巴西专利BR112018015913B1 method, implemented using a computer system comprising one or more processors and memory system, for

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Methods for determining known or suspected copy number (cnv) variation in association with a variety of medical conditions are described. In some embodiments, methods are provided for determining variation in fetal copy number using maternal samples comprising both maternal and fetal cell free. In some embodiments, methods are provided for determining known or suspected CNVs associated with a variety of medical conditions. Some embodiments described herein provide methods for improving the sensitivity and / or specificity of sequence data analysis by deriving a fragment size parameter. In some implementations, fragment information of different sizes is used to evaluate copy number variations. In some implementations, one or more t-statistics obtained from the sequence of interest coverage information are used to evaluate copy number variations. In some implementations, one or more fetal fraction estimates are combined with one or more t-statistics to determine copy number variations.
公开号:BR112018015913B1
申请号:R112018015913
申请日:2016-12-20
公开日:2019-12-03
发明作者:Barbacioru Catalin；I Chudova Darya；A Comstock David；Skvortsov Dimitri；Chen Gengxin；W Jones Keith；P Rava Richard；Duenwald Sven
申请人:Verinata Health Inc；
IPC主号:

专利说明:

“METHOD, IMPLEMENTED USING A COMPUTER SYSTEM UNDERSTANDING ONE OR MORE PROCESSORS AND MEMORY SYSTEM, TO DETERMINE A CHANGE IN THE COPY NUMBER OF A NUCLEIC ACID SEQUENCE OF INTEREST, AND, SYSTEM TO EVALUATE A NUMBER OF NUMBER OF NUMBER OF NUMBER OF NUMBER OF NUMBERS. NUCLEIC OF INTEREST ”CROSS REFERENCE TO RELATED APPLICATIONS [001] This application claims the benefit under 35 USC § 119 (e) for US Provisional Patent Application No. 62 / 290.891, entitled: USE OF THE DNA FRAGMENT SIZE EXEMPT FROM CELL TO DETERMINE VARIATIONS IN THE NUMBER OF COPIES, filed on February 3, 2016 and US Patent Application No. 15 / 382,508, entitled: USE OF THE CELL-FREE DNA FRAGMENT SIZE TO DETERMINE VARIATIONS IN THE COPY NUMBER, deposited at 16 December 2016, which are incorporated herein by reference in their entirety for all purposes.
FUNDAMENTALS
[002] One of the critical endeavors in human medical research is the discovery of genetic abnormalities that produce adverse health consequences. In many cases, specific genes and / or critical diagnostic markers have been identified in portions of the genome that are present in the abnormal copy number. For example, in prenatal diagnosis, extra or missing copies of whole chromosomes are genetic lesions that occur frequently. In cancer, the exclusion or multiplication of copies of whole chromosomes or chromosomal segments and higher level amplifications of specific regions of the genome are common occurrences.
[003] Most of the information on the variation in the number of copies (CNV) has been provided by cytogenetic resolution, which has allowed the recognition of structural abnormalities. Conventional procedures for genetic screening and biological dosimetry used invasive procedures, for example, amniocentesis, cordocentesis or cronionic villus biopsy (CVS), to obtain cells for the analysis of karyotypes. Recognizing the need for faster test methods that do not require cell culture, fluorescent in situ hybridization (FISH), fluorescent quantitative PCR (QF-PCR) and Comparative Genomic Hybridization (CGH-arrangement) have been developed as molecular cytogenetic methods for analysis of variations in the number of copies. [004] One of the critical endeavors in human medical research is the discovery of genetic abnormalities that produce adverse health consequences. In many cases, specific genes and / or critical diagnostic markers have been identified in portions of the genome that are present in an abnormal number of copies. For example, in prenatal diagnosis, extra or missing copies of whole chromosomes are genetic lesions that occur frequently. In cancer, the exclusion or multiplication of copies of whole chromosomes or chromosomal segments and higher level amplifications of specific regions of the genome are common occurrences.
[005] Most of the information on the variation in the number of copies (CNV) has been provided by cytogenetic resolution, which has allowed the recognition of structural abnormalities. Conventional procedures for genetic screening and biological dosimetry used invasive procedures, for example, amniocentesis, cordocentesis or cronionic villus biopsy (CVS), to obtain cells for the analysis of karyotypes. Recognizing the need for faster test methods that do not require cell culture, fluorescent in situ hybridization (FISH), fluorescent quantitative PCR (QF-PCR) and Comparative Genomic Hybridization (CGH-arrangement) have been developed as molecular cytogenetic methods for analysis of variations in the number of copies. [006] The advent of technologies that allow sequencing whole genomes in a relatively short time and the discovery of circulating cell-free DNA (cfDNA) provided the opportunity to compare the genetic material that originates from one chromosome to be compared to that of another without the risks associated with invasive sampling methods, it provides a tool to diagnose various types of variations in the number of copies of the genetic sequences of interest.
[007] The limitations of existing methods in non-invasive prenatal diagnostics, which include insufficient sensitivity due to the limited levels of cfDNA and the sequencing trends in technology due to the inherent nature of genomic information, justify the continuing need for methods non-invasive tests that would provide any or all of specificity, sensitivity and applicability to reliably diagnose changes in copy number in a variety of clinical conditions. It has been shown that the average lengths of fetal cfDNA fragments are shorter than maternal cfDNA fragments in the pregnant woman's plasma. This difference between maternal and fetal cfDNA is explored in the implementation here to determine CNV and / or fetal fraction. The modalities described here satisfy some of the needs above. Some modalities can be implemented with a PCR-free library training linked with paired end DNA sequencing. Some modalities provide high analytical sensitivity and specificity for non-invasive prenatal diagnoses and diagnoses for a variety of diseases.
SUMMARY
[008] In some embodiments, methods are provided to determine copy number variation (CNV) of any fetal aneuploidy and known or suspected CNVs are associated with a variety of medical conditions. CNVs that can be determined according to the present method include the trisomies and monosomies of any one or more of chromosomes 1 to 22, X and Y, other chromosomal polysomies and exclusions and / or duplications of the segments of any one or more of the chromosomes . In some embodiments, the methods involve identifying the CNVs of a nucleic acid sequence of interest, for example, a clinically relevant sequence in a test sample. The method assesses the variation in the number of copies of the specific sequence of interest.
[009] In some embodiments, the method is implemented in a computer system that includes one or more processors and system memory to evaluate the number of copies of a nucleic acid sequence of interest in a test sample comprising the nucleic acids of a or more genomes.
[0010] One aspect of the description refers to a method for determining a variation in the number of copies (CNV) of a nucleic acid sequence of interest in a test sample including cell-free fragments of nucleic acid that originate from two or more more genomes. The method includes: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning the fragments containing the sequence readings to the bins of a reference genome including the sequence of interest, thereby providing test sequence labels, where the reference genome is divided into a plurality of bins; (c) determining the fragment sizes of at least some of the cell-free nucleic acid fragments present in the test sample; (d) calculate the coverage of the sequence labels for the reference genome bins for each bin: (i) determine several sequence labels that align with the bin and (ii) normalize the number of sequence labels that align with the bin taking responsibility for variations of bin-a-bin due to factors other than variation in the number of copies; (e) determine a t statistic for the sequence of interest using coverage of the bins in the sequence of interest and the coverage of the bins in a reference region for the sequence of interest; and (f) determining a variation in the number of copies in the sequence of interest using a probability ratio calculated from the t-statistic and information about the sizes of the cell-free nucleic acid fragments.
[0011] In some implementations, the method includes performing (d) and (e) twice, once for fragments in a first size domain and again for fragments in a second size domain. In some implementations, the first size domain includes cell-free nucleic acid fragments of substantially all sizes in the sample, and the second size domain includes only cell-free nucleic acid fragments smaller than a defined size. In some implementations, the second size domain includes only those cell-free nucleic acid fragments smaller than about 150 base pairs. In some implementations, the probability ratio is calculated from a first t statistic for the sequence of interest using sequence labels for fragments in a first size range and a second t statistic for the sequence of interest using sequence labels for fragments in a second size range.
[0012] In some implementations, the probability ratio is calculated as a first probability that the test sample is an aneuploid sample over a second probability that the test sample is a euploid sample.
[0013] In some implementations, the probability ratio is calculated from one or more fetal fraction values in addition to the t-statistic and information about the sizes of cell-free nucleic acid fragments.
[0014] In some implementations, the one or more fetal fraction values include a fetal fraction value calculated using information about the sizes of cell-free nucleic acid fragments. In some implementations, the fetal fraction value is calculated: obtaining a frequency distribution of the fragment sizes; and applying the frequency distribution to a model related to the fetal fraction to the fragment size frequency to obtain the fetal fraction value. In some implementations, the fetal fraction-related model for the fragment size frequency includes a general linear model having a plurality of terms and coefficients for a plurality of fragment sizes.
[0015] In some implementations, the one or more fetal fraction values include a fetal fraction value calculated using coverage information for the reference genome bins. In some implementations, the fetal fraction value is calculated by applying the coverage values of a plurality of bins to a model related to the fetal fraction to cover the bin to obtain the fetal fraction value. In some implementations, the model related to the fetal fraction for bin coverage includes a general linear model having a plurality of terms and coefficients for a plurality of bins. In some implementations, the plurality of bins has a high correlation between the fetal fraction and the coverage in the training samples.
[0016] In some implementations, the one or more fetal fraction values include a fetal fraction value calculated using the frequencies of a plurality of 8-merss found in the readings. In some implementations, the fetal fraction value is calculated: applying the frequencies of a plurality of 8-mers to a model related to the fetal fraction for the 8-mers frequency to obtain the fetal fraction value. In some implementations, the fetal fraction-related model for the 8-mers frequency includes a general linear model having a plurality of terms and coefficients for a plurality of 8-mers. In some implementations, the plurality of 8-mers has a high correlation between fetal fraction and frequency of 8-mers.
[0017] In some implementations, the one or more fetal fraction values include a fetal fraction value calculated using coverage information for the bins of a sex chromosome.
[0018] In some implementations, the probability ratio is calculated from a fetal fraction, a short fragments t statistic and a total fragments t statistic, where the short fragments are cell-free nucleic acid fragments in a first size range smaller than a criterion size and the total fragments are cell-free nucleic acid fragments including the short fragments and fragments longer than the criterion size. In some implementations, the probability ratio is calculated: total f tdfílí) * Pl (Tjurto :. Ttodo Po (7 [nrtüp Ttoda) where pi represents the probability that the data originates from a multivariate normal distribution representing a model of 3 copies or 1 copy, po represents the probability that the data will originate from a multivariate normal distribution representing a 2 copy model, TCUrto, Ttodo are T counts calculated from the chromosomal coverage generated from short fragments and total fragments and totai) is a density distribution of the fetal fraction.
[0019] In some implementations, the probability ratio is calculated from one or more fetal fraction values in addition to the t statistic and information about the sizes of cell-free nucleic acid fragments.
[0020] In some implementations, the probability ratio is calculated for monosomy X, trisomy X, trisomy 13, trisomy 18 or trisomy 21.
[0021] In some implementations, normalizing the number of sequence labels includes: normalizing the GC content of the sample, normalizing a global wave profile of the variation of a training set, and / or normalizing one or more components obtained from an analysis main component. [0022] In some implementations, the sequence of interest is a human chromosome selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, X chromosome and Y chromosome.
[0023] In some implementations, the reference region is all robust chromosomes, the robust chromosomes not including the sequence of interest, at least one chromosome outside the sequence of interest, and / or a subset of chromosomes selected from the robust chromosomes. In some implementations, the reference region includes the robust chromosomes that have been determined to provide the best signal selection capability for a set of training samples.
[0024] In some implementations, the method also includes calculating the values of a size parameter for the bins, for each bin: (i) determining a value of the size parameter from the sizes of nucleic acid fragments free from cell in the bin and (ii) normalizing the value of the size parameter, being responsible for variations from bin-to-bin due to factors other than the variation in the number of copies. The method also includes determining a t statistic based on the size for the sequence of interest using values of the size parameter of the bins in the sequence of interest and values of the size parameter of the bins in the reference region for the sequence of interest. In some implementations, the probability ratio of (f) is calculated from the t statistic and the t statistic based on size. In some implementations, the probability ratio of (f) is calculated from the t-statistic based on size and a fetal fraction.
[0025] In some implementations, the method also includes comparing the probability ratio to a calling criterion to determine a variation in the number of copies in the sequence of interest. In some implementations, the probability ratio is converted to a log probability ratio before being compared to the calling criterion. In some implementations, the call criterion is obtained by applying different criteria to a training set of training samples and selecting a criterion that provides a defined sensitivity and a defined selectivity.
[0026] In some implementations, the method also includes obtaining a plurality of probability ratios and applying the plurality of probability ratios to a decision tree to determine a ploidy case for the sample.
[0027] In some implementations, the method also includes obtaining a plurality of probability ratios and one or more values of coverage of the sequence of interest and applying the plurality of probability ratios and one or more values of coverage of the sequence of interest to a decision tree to determine a ploidy case for the sample.
[0028] Another aspect of the description relates to a method for determining a variation in the number of copies (CNV) of a nucleic acid sequence of interest in a test sample including cell-free fragments of nucleic acid that originate from two or more more genomes. The method includes: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning the fragments containing the sequence readings to the bins of a reference genome including the sequence of interest, thereby providing the test sequence labels, where the reference genome is divided into a plurality of bins; (c) calculate the coverage of the sequence labels for the reference genome bins, for each bin: (i) determining several sequence labels that align with the bin and (ii) normalizing the number of sequence labels that line up with the bin and are responsible for variations in bin-a-bin due to factors other than variation in the number of copies. The method also includes: (d) determining a t-statistic for the sequence of interest using coverage of bins in the sequence of interest and coverage of bins in a reference region for the sequence of interest; (e) estimate one or more fetal fraction values of cell-free nucleic acid fragments in the test sample; and (f) determining a variation in the number of copies in the sequence of interest using the t statistic and the one or more fetal fraction values.
[0029] In some implementations, (f) includes calculating a probability ratio of the t-statistic and the one or more fetal fraction values. In some implementations, the probability ratio is calculated for monosomy X, trisomy X, trisomy 13, trisomy 18 or trisomy 21.
[0030] In some implementations, normalizing the number of sequence labels includes: normalizing the GC content of the sample, normalizing a global wave profile of the variation of a training set, and / or normalizing the one or more components obtained from a main component analysis. [0031] In some implementations, the sequence of interest is a human chromosome selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, X chromosome and Y chromosome.
[0032] Another aspect of the description relates to a method for determining a variation in the number of copies (CNV) of a nucleic acid sequence of interest in a test sample including cell-free fragments of nucleic acid that originate from two or more more genomes. The method includes: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning fragments containing the sequence readings to the bins of a reference genome including the sequence of interest, thereby providing test sequence labels, where the genome reference is divided into a plurality of bins; determining the fragment sizes of the cell-free nucleic acid fragments that exist in the test sample; (d) calculating the coverage of the sequence labels for the reference genome bins using the sequence labels for the cell-free nucleic acid fragments having sizes in a first size domain; (e) calculating the coverage of the sequence labels for the reference genome bins using sequence labels for the cell-free nucleic acid fragments having sizes in a second size domain, where the second size domain is different from the first size domain; (f) calculate the size characteristics for the reference genome bins using the fragment sizes determined in (c); and (g) determining a variation in the number of copies in the sequence of interest using the coverages calculated in (e) and the size characteristics calculated in (f).
[0033] In some implementations, the first size domain includes cell-free fragments of substantially all sizes in the sample and the second size domain includes only cell-free fragments of nucleic acid smaller than a defined size. In some implementations, the second size domain includes only those cell-free nucleic acid fragments smaller than about 150 base pairs.
[0034] In some implementations, the sequence of interest is a human chromosome selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, X chromosome and Y chromosome.
[0035] In some implementations, (g) includes calculating a t statistic for the sequence of interest using the coverage of the bins in the sequence of interest calculated in (d) and / or (e). In some implementations, where calculating the t-statistic for the sequence of interest includes using the coverage of the bins in the sequence of interest and the coverage of bins in a reference region for the sequence of interest.
[0036] In some implementations, (g) includes calculating a t-statistic for the sequence of interest using the size characteristics of the bins in the sequence of interest calculated in (f). In some implementations, calculating the t-statistic for the sequence of interest includes using the size characteristics of the bins in the sequence of interest and the size characteristics of the bins in a reference region for the sequence of interest.
[0037] In some implementations, the size characteristic for a bin includes a fragment ratio of sizes smaller than a defined value for the total fragments in the bin.
[0038] In some implementations, (g) includes calculating a probability ratio of the t statistic.
[0039] In some implementations, (g) includes calculating a probability ratio of a first t statistic for the sequence of interest using the coverage calculated in (d) and a second t statistic for the sequence of interest using the coverage calculated in ( and).
[0040] In some implementations, (g) includes calculating a probability ratio of a first t statistic for the sequence of interest using the coverage calculated in a second t statistic for the sequence of interest using the coverage calculated in the third t statistic for the sequence of interest using the size characteristics calculated in (f). [0041] In some implementations, the probability ratio is calculated from one or more fetal fraction values in addition to at least one first and second t statistics. In some implementations, the method also includes calculating the one or more fetal fraction values using information about the sizes of cell-free nucleic acid fragments.
[0042] In some implementations, the method also includes calculating the one or more fetal fraction values using coverage information for the reference genome bins. In some implementations, the one or more fetal fraction values include a fetal fraction value calculated using coverage information for the bins of a sex chromosome. In some implementations, the probability ratio is calculated for monosomy X, trisomy X, trisomy 13, trisomy 18 or trisomy 21.
[0043] In some implementations, (d) and / or (e) include: (i) determining several sequence labels that align with the bin and (ii) normalizing the number of sequence labels that align with the bin taking responsibility by variations from bin-to-bin due to factors other than variation in the number of copies. In some implementations, normalizing the number of sequence labels includes: normalizing the GC content of the sample, normalizing a global wave profile of variation in a training set, and / or normalizing one or more components obtained from a principal component analysis.
[0044] In some implementations, (f) includes calculating the values of a size parameter for the bins, for each bin: (i) determining a value of the size parameter from the sizes of nucleic acid fragments free from cell in the bin and (ii) normalizing the size parameter value being responsible for the variations of bin-a-bin due to factors other than the variation in the number of copies.
[0045] Another aspect of the description relates to a system for evaluating a number of copies of a nucleic acid sequence of interest in a test sample, the system includes: a sequencer for receiving fragments of nucleic acid from the test sample and providing nucleic acid sequence information from the test sample; a processor; and one or more computer-readable storage means having instructions for execution on said processor stored therein. The instructions include instructions for: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning the fragments containing the sequence readings to the bins of a reference genome including the sequence of interest, thereby providing the test sequence labels, where the reference genome is divided into a plurality of bins; (c) determining the fragment sizes of at least some of the cell-free nucleic acid fragments present in the test sample; and (d) calculate the coverage of the sequence labels for the reference genome bins, for each bin: (i) determining several sequence labels that align with the bin and (ii) normalizing the number of sequences that align with the bin and are responsible for variations in bin-a-bin due to factors other than variation in the number of copies. The method also includes: (e) determining a t-statistic for the sequence of interest using cover bins in the sequence of interest and cover bins in a reference region for the sequence of interest; and (f) determining a variation in the number of copies in the sequence of interest using a probability ratio calculated from the t-statistic and information about the sizes of the cell-free nucleic acid fragments.
[0046] In some implementations, the system is configured to perform any of the methods described above.
[0047] An additional aspect of the description refers to a computer program product including one or more computer-readable non-transient storage media having computer-readable instructions stored therein, which, when executed by one or more processors of a computer system, cause the computer system to implement any of the above methods.
[0048] Although the examples contained herein are related to human beings and the language is mainly directed to human interests, the concepts described here are applicable to the genomes of any plant or animal. These and other objectives and characteristics of the present description will become more fully evident from the following description and appended claims or can be learned by practicing the description as the set presented below. INCORPORATION BY REFERENCE [0049] All patents, patent applications and other publications, including all the strings described within these references indicated herein are expressly incorporated by reference, to the same degree as if each individual publication, patent or patent application was specific and individually indicated to be incorporated by reference. All documents cited are, in part relevant, hereby incorporated by reference in their entirety for the purposes indicated by the context of the quotation contained herein. However, the citation of any document should not be interpreted as an admission that this is the prior art with respect to the present description.
BRIEF DESCRIPTION OF THE FIGURES
[0050] Figure 1 is a flow chart of a method 100 for determining the presence or absence of a variation in the number of copies in a test sample comprising a mixture of nucleic acids.
[0051] Figure 2A illustrates thematically how paired termination sequencing can be used to determine both fragment size and coverage sequence.
[0052] Figure 2B shows a flowchart of a process for using a size-based cover to determine a variation in the number of copies of a nucleic acid sequence of interest in a test sample.
[0053] Figure 2C represents a flowchart of a process to determine the fragment size parameter for a nucleic acid sequence of interest used for the evaluation of the number of copies.
[0054] Figure 2D shows a flowchart of two overlapping passages of the workflow.
[0055] Figure 2E shows a flow chart of a three-step process to evaluate the number of copies.
[0056] Figure 2F shows implementations that apply a t-statistic to the copy number analysis to improve the accuracy of the analysis. [0057] Figure 2G shows an exemplary process for determining the fetal fraction of coverage information according to some implementations of the description.
[0058] Figure 2H shows a process for determining the fetal fraction of the size distribution information according to some implementations.
[0059] Figure 2I shows an exemplary process for determining the fetal fraction of the 8-month frequency information according to some implementations of the description.
[0060] Figure 2J shows a workflow to process information from the sequence readings which can be used to obtain fetal fraction estimates.
[0061] Figure 3A shows a flow chart of an example of a process to reduce noise in the sequence data of a test sample. [0062] Figures 3B to 3K present the analysis of the data obtained in several stages of the process described in Figure 3A.
[0063] Figure 4A shows a flow chart of a process for creating a sequence mask to reduce noise in the sequence data.
[0064] Figure 4B shows that the MapQ count has a stronger monotonous correlation with the CV of normalized coverage amounts.
[0065] Figure 5 is a block diagram of a dispensed system for processing a test sample and, finally, making a diagnosis.
[0066] Figure 6 illustrates schematically how the different operations in the processing of test samples can be grouped to be controlled by different elements of a system.
[0067] Figures 7A and 7B show electropherograms from a cfDNA sequencing library prepared according to the abbreviated protocol described in Example 1a (Fig. 7A) and the protocol described in Example 1b (Fig. 7B).
[0068] Figure 8 shows the general workflow and timeline for a new version of NIPT compared to standard laboratory workflow.
[0069] Figure 9 shows the performance of the sequencing library as a function of the extracted cfDNA input, indicating a stronger linear correlation with the concentration of the library for the input concentration with a high conversion efficiency.
[0070] Figure 10 shows the fragment size distribution of cfDNA as measured from 324 pregnancy samples with a male fetus.
[0071] Figure 11 shows the relative fetal fraction of the total counts of paired termination readings mapped compared to counts of paired termination readings that are less than 150 base pairs.
[0072] Figure 12 shows the combined t-statistic aneuploidy count for detecting trisomy 21 samples for (a) total fragment counts; (B) short fragment counts (<150 base pairs) only; (C) fraction of short fragments (counts between 80 and 150 base pairs / counts <250 base pairs); (D) combined t-statistics of (B) and (C); and (E) results for the same samples obtained using the Illumina Redwood City CLIA laboratory process with an average of 16 M counts / sample.
[0073] Figure 13 shows the estimated fetal fractions of the selected bins versus those measured with normalized chromosome values (REF) for the X chromosome. Set 1 was used to calibrate the fetal fraction value and an independent set 2 to test the correlation.
DETAILED DESCRIPTION
Definitions [0074] Unless otherwise indicated, the practice of the method and system described herein involves conventional techniques and apparatus commonly used in molecular biology, microbiology, protein purification, protein engineering, protein and DNA sequencing and DNA fields recombinants, which are within the skill of the technique. Such techniques and devices are known to those of skill in the art and are described in numerous texts and reference works (See, for example, Sambrook et al., “Molecular Cloning: A Laboratory Manual”, Third Edition (Cold Spring Harbor), [ 2001]); and Ausubel et al., “Current Protocols in Molecular Biology” [1987]).
[0075] The numerical ranges are inclusive of the numbers defining the range. It is intended that each maximum numerical limitation given throughout this specification includes each lower numerical limitation, as if such lower numerical limitations were expressly written here. Each minimum numerical limitation given throughout this specification will include each higher numerical limitation, as if such higher numerical limitations were expressly written here. Each numerical range given throughout this specification will include each narrower numerical range that falls within such a wider numerical range, as if such narrower numerical ranges were all expressly written here.
[0076] The headings provided here are not intended to limit the description.
[0077] Unless otherwise defined here, all the technical and scientific terms used herein have the same meaning as usually understood by a person of ordinary skill in the art. Several scientific dictionaries that include the terms included here are well known and available to those in the art. Although any methods and materials similar or equivalent to those described herein find use in practice or testing the modalities described herein, some methods and materials are described.
[0078] The terms defined immediately below are more fully described by reference to the specification as a whole. It should be understood that such a description is not limited to the particular methodology, protocols and reagents described, as these may vary, depending on the text in which they are used by those of skill in the art. As used herein, the singular terms "one", "one" and "o / a" include the plural reference unless the context clearly indicates otherwise.
[0079] Unless otherwise indicated, nucleic acid acids are described from left to right in the 5 'to 3' orientation and amino acid sequences are described from left to right in the amino to carboxy orientation, respectively.
[0080] The term "parameter" is here to represent a physical trait whose value or other characteristics have an impact on a relevant condition such as variation in the number of copies. In some cases, the term parameter is used with reference to a variable that affects the product of a mathematical relationship or model, a variable that can be an independent variable (that is, an entry for the model) or an intermediate variable based on a or more independent variables. Depending on the scope of a model, an output from one model can become an input from another model, thereby becoming a parameter for the other model.
[0081] The term "fragment size parameter" refers to a parameter that refers to the size or length of a fragment or a collection of fragments such as fragments of the nucleic acid; for example, a cfDNA fragment obtained from a body fluid. As used herein, a parameter is “skewed to a fragment size or stripe size” when: 1) the parameter is favorably weighted to the fragment size or stripe size, for example, a more expressively weighted count when associated with fragments band size or size than for other sizes or bands; or 2) the parameter is obtained from a value that is favorably weighted for the fragment size or stripe size, for example, a ratio obtained from a weighted count more expressively when associated with fragments of size or stripe size. A fragment size or stripe size can be a characteristic of a genome or a portion thereof when the genome produces fragments of the nucleic acid enriched in or having a higher concentration of size or stripe size relative to the nucleic acid fragments of a another genome or another portion of the same genome.
[0082] The term "weighting" refers to the modification of a quantity such as a parameter or variable using one or more values or functions, which are considered "weight". In certain modalities, the parameter or variable is multiplied by the weight. In other modalities, the parameter or variable is spontaneously modified. In some embodiments, the function can be a linear or non-linear function. Examples of applicable non-linear functions include, but are not limited to, Heaviside step function, box-car functions, stair functions or sigmoidal functions. Weighting an original parameter or variable can systematically increase or decrease the value of the weightable variable. In several modalities, weighting can result in positive, non-negative or negative values. [0083] The term "copy number variation" here refers to the variation in the number of copies of a nucleic acid sequence present in a test sample compared to the copy number of the nucleic acid sequence present in a sample reference. In certain embodiments, the nucleic acid sequence is 1 kb or greater. In some cases, the nucleic acid sequence is an entire chromosome or a significant portion of it. A “copy number variant” refers to the nucleic acid sequence in which copy number differences are found by comparing a nucleic acid sequence of interest in the test samples to an expected level of the nucleic acid sequence of interest. For example, the level of the nucleic acid sequence of interest in the test sample is compared to that in a qualified sample. Copy number variants / variations include deletions, including microdeletions, insertions, including microinserts, duplications, multiplications, and translocations. CNVs include chromosomal aneuploidies and partial aneuploidies.
[0084] The term "aneuploidy" here refers to an imbalance of genetic material caused by a loss or gain of an entire chromosome or part of a chromosome.
[0085] The terms "chromosomal aneuploidy" and "complete chromosomal aneuploidy" here refer to an imbalance of genetic material caused by a loss or gain of an entire chromosome and includes germline aneuploidy and mosaic aneuploidy.
[0086] The terms "partial aneuploidy" and "partial chromosomal aneuploidy" here refer to an imbalance of genetic material caused by a loss or gain of part of a chromosome, for example, partial monosomy and partial trisomy and encompasses imbalances resulting from translocations, deletions and insertions.
[0087] The term “plurality” refers to more than one element.
For example, the term is used here in reference to the various nucleic acid molecules or sequence labels that are sufficient to identify significant differences in variations in the copy number in the test sample and qualified samples using the methods described herein. In some embodiments, at least about 3 x 106 sequence labels of between about 20 and 40 base pairs are obtained for each test sample. In some embodiments, each test sample provides data for at least about 5 x 106, 8 x 106, 10 x 106, 15 x 106, 20 x 106, 30 x 106, 40 x 106 or 50 x 106 sequence labels, each sequence label comprising between about 20 and 40bp.
[0088] The term "paired end readings" refers to the paired end sequencing readings that obtain a reading from each end of a fragment of the nucleic acid. Paired end sequencing may involve fragmenting polynucleotide strands into short sequences called inserts. Fragmentation is optional or not necessary for relatively short nucleotides such as cell-free DNA molecules.
[0089] The terms "polynucleotide", "nucleic acid" and "nucleic acid molecules" are used interchangeably and refer to a covalently linked sequence of nucleotides (i.e., RNA ribonucleotides and DNA deoxyribonucleotides) in which the position 3 'of the pentose of a nucleotide is joined by a phosphodiester group to the 5' position of the pentose of the next. Nucleotides include sequences of any form of nucleic acid, including, but not limited to, RNA and DNA molecules such as cIDNA molecules. The term "polynucleotide" includes, without limitation, single and double filament polynucleotide.
[0090] The term "test sample" here refers to a sample, typically derived from a biological fluid, cell, tissue, organ or organism, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence that should be screened for variation in the number of copies. In certain embodiments, the sample comprises at least one nucleic acid sequence whose copy number is suspected to have varied. Such samples include, but are not limited to, oral saliva / fluid, amniotic fluid, blood, a fraction of the blood or fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid , pleural fluid and the like. Although the sample is often taken from a human subject (eg, patient), the tests can be used for variations in the number of copies (CNVs) in samples from any animal, including, but not limited to, dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample can be used directly as obtained from the biological source or followed by a pre-treatment to modify the sample characteristic. For example, such a pretreatment may include plasma prepared from blood, dilute viscous fluids, and so on. Pre-treatment methods can also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents , lysis, etc. If such pre-treatment methods are used with respect to the sample, such pre-treatment methods are typically such that the nucleic acid (s) of interest remain (s) in the test sample, sometimes in a concentration proportional to that in an untreated test sample (for example, that is, a sample that is not subjected to any such pre-treatment method (s). Such “treated” or “processed” samples ”Are considered to still be“ biological test samples ”with respect to the methods described here.
[0091] The term “qualified sample” or “unaffected sample” here refers to a sample comprising a mixture of nucleic acids that are present in a known copy number with which the nucleic acids in a test sample are to be compared and it is a sample that is normal, that is, non-aneuploid, for the nucleic acid sequence of interest. In some embodiments, qualified samples are used as unaffected training samples from a training set to derive sequence masks or sequence profiles. In certain embodiments, qualified samples are used to identify one or more chromosomes or normalizing segments for a chromosome under consideration. For example, qualified samples can be used to identify a normalizing chromosome for chromosome 21. In each case, the qualified sample is a sample that is not a sample of trisomy 21. Another example involves using only females as qualifying samples for the chromosome X. Qualified samples can also be used for other purposes such as determining thresholds for identifying samples affected by the call, identifying thresholds for determining mask regions in a reference sequence, determining quantities of specified coverages for different regions of a genome and the like .
[0092] The term “training set” here refers to a set of training samples that can comprise affected and / or unaffected samples and are used to develop a model for analyzing test samples. In some modalities, the training set includes unaffected samples. In these modalities, the thresholds for determining CNV are established using sample training sets that are not affected by the variation in the number of copies of interest. Unaffected samples in a training set can be used as qualified samples to identify normalizing sequences, for example, normalizing chromosomes and unaffected sample chromosome doses are used to adjust the thresholds for each of the sequences, for example, chromosomes, of interest. In some modalities, the training set includes affected samples. The affected samples in a training set can be used to verify the affected test samples can be easily differentiated from the unaffected samples.
[0093] A training set is also a statistical sample in a population of interest, the statistical sample that should not be confused with a biological sample. A statistical sample often comprises multiple individuals, data from these individuals are used to determine one or more quantitative values of generalizable interest to the population. The statistical sample is a subset of individuals in the population of interest. Individuals can be people, animals, tissues, cells, other biological samples (that is, a statistical sample can include multiple biological samples) and other individual entities provide data points for statistical analysis.
[0094] Usually, a training set is used in conjunction with a validation set. The term “validation set” is used to refer to a set of individuals in a statistical sample, data from these individuals are used to validate or evaluate the quantitative values of interest determined using a training set. In some modalities, for example, a training set provides data for calculating a mask for a reference sequence, while a validation set provides data for evaluating the mask's validity or effectiveness.
[0095] The term “copy number evaluation” is used here in reference to the statistical evolution of the situation of a genetic sequence related to the copy number of the sequence. For example, in some modalities, evolution involves determining the presence or absence of a genetic sequence. In some modalities, evolution comprises the determination of partial or complete aneuploidy of a genetic sequence. In other modalities, evolution comprises discrimination between two or more samples based on the number of copies of a genetic sequence. In some modalities, evolution comprises statistical analyzes, for example, normalization and comparison, based on the copy number of the genetic sequence.
[0096] The term "qualified nucleic acid" is used interchangeably with "qualified sequence", which is a sequence against which the amount of a sequence or nucleic acid of interest is compared. A qualified sequence is present in a biological sample preferably in a known representation, that is, the amount of a qualified sequence is known. Generally, a qualified sequence is the sequence present in a “qualified sample”. A "qualified sequence of interest" is a qualified sequence for which the amount is known in a qualified sample and is a sequence that is associated with a difference in a sequence of interest between a control subject and an individual with a medical condition.
[0097] The terms "sequence of interest" or "nucleic acid sequence of interest" here refer to a nucleic acid sequence that is associated with a difference in sequence representation between healthy and sick individuals. A sequence of interest can be a sequence on a chromosome that is poorly represented, that is, over- or under-represented, in a disease or genetic condition. A sequence of interest can be a portion of a chromosome, that is, a chromosome segment or an entire chromosome. For example, a sequence of interest can be a chromosome that is over-represented in an aneuploidy condition or a gene that encodes a tumor suppressor that is under-represented in cancer. The sequences of interest include sequences that are over- or under-represented in the total population or a subpopulation of a subject's cells. A “qualified sequence of interest” is a sequence of interest in a qualified sample. A “test sequence of interest” is a sequence of interest in a test sample.
[0098] The term "normalized sequence" here refers to a sequence that is used to normalize the number of sequence labels mapped to a sequence of interest associated with the normalized sequence. In some embodiments, a standardized sequence comprises a robust chromosome. A "robust chromosome" is one that is unlikely to be aneuploid. In some cases involving the human chromosome, a robust chromosome is any chromosome other than the X chromosome, Y chromosome, chromosome 13, chromosome 18 and chromosome 21. In some embodiments, the normalized sequence shows a variety in the number of sequence labels that they are mapped to the same between the samples and the sequencing rounds that approach give variability of the sequence of interest for which it is used as a normalizing parameter. The normalized sequence can differentiate an affected sample from one or more unaffected samples. In some implementations, the normalized sequence better or effectively differentiates, when compared to other potential normalizing sequences such as other chromosomes, an affected sample from one or more unaffected samples. In some embodiments, the variability of the normalized sequence is calculated as the variability in the chromosome dose for the sequence of interest across the samples and sequencing runs. In some modalities, normalizing sequences are identified in a set of unaffected samples.
[0099] A "normalizing chromosome", "normalizing chromosome" or "normalizing chromosome sequence" is an example of a "normalized sequence". A "normalizing chromosome sequence" can be composed of a single chromosome or a group of chromosomes. In some embodiments, a normalized sequence comprises two or more robust chromosomes. In certain embodiments, the robust chromosomes are all autosomal chromosomes other than X, Y, 13, 18 and 21 chromosomes. A "normalizing segment" is another example of a "normalized sequence". A "normalizing segment sequence" can be made up of a single segment of a chromosome or it can be made up of two or more segments of the same or different chromosomes. In certain modalities, a normalized sequence is intended to normalize in terms of variability such as process-related variability, intercromosomal (intra-round) and inter-sequencing (inter-round).
[00100] The term "differentiability" here refers to a characteristic of a normalizing chromosome that allows one to distinguish one or more unaffected samples, that is, normal from one or more affected samples, that is, aneuploid. A normalizing chromosome demonstrating the greatest “differentiability” is a chromosome or group of chromosomes that provide the greatest statistical difference between the distribution of chromosome doses for a chromosome of interest in a set of qualified samples and chromosome doses for the same chromosome of interest in the corresponding chromosome in one or more affected chromosomes.
[00101] The term "variability" here refers to another characteristic of a normalizing chromosome that allows one to distinguish one or more unaffected samples, that is, normal, from one or more affected z samples, that is, aneuploids. The variability of a normalizing chromosome, which is measured in a set of qualified samples, refers to the variability in the number of sequence labels that are mapped to the same as the variability in the number of sequence labels that are mapped to a chromosome of interest for which it serves as a normalizing parameter.
[00102] The term "sequence label density" here refers to the number of sequence readings that are mapped to a reference sequence genome, for example, the sequence label density for chromosome 21 is the number of sequence readings generated by the sequencing method that are mapped to chromosome 21 of the reference genome.
[00103] The term "sequence label density ratio" here refers to the ratio of the number of sequence labels that are mapped to a chromosome of the reference genome, for example, chromosome 21, to the length of the chromosome of the genome of reference.
[00104] The term “sequence dose” here refers to a parameter that refers to the number of sequence labels or another parameter identified for a sequence of interest and the number of sequence labels or the other parameter identified for normalized sequences . In some cases, the sequence dose is the reason for covering the sequence label or the other parameter for a sequence of interest for covering the sequence label or the other parameter for a standardized sequence. In some cases, the sequence dose refers to a parameter that refers to the density of the sequence label of a sequence of interest to the density of the sequence label of a normalized sequence. A “test sequence dose” is a parameter that refers to the density of the sequence label or the other parameter of a sequence of interest, for example, chromosome 21, to that of a normalized sequence, for example, chromosome 9, determined in a test sample. Similarly, a “qualified sequence dose” is a parameter that refers to the density of the sequence label or the other parameter of a sequence of interest to that of a normalized sequence determined in a qualified sample.
[00105] The term "coverage" refers to the abundance of sequence labels mapped to a defined sequence. The coverage can be quantitatively indicated by the density of the sequence label (or count of sequence labels), the ratio of the density of the sequence label, amount of standardized coverage, adjusted coverage values, etc.
[00106] The term "amount of coverage" refers to a change in gross coverage and often represents the relative amount of sequence labels (sometimes called counts) in a region of a genome such as a bin. An amount of coverage can be obtained by normalizing, adjusting and / or correcting the gross coverage or count for a region of the genome. For example, a normalized amount of coverage for a region can be obtained by dividing the count of the region-mapped sequence label by the total number of sequence labels mapped to the entire genome. The amount of normalized coverage allows comparison of coverage of a bin across different samples, which can have different sequencing depths. It differs from the sequence dose in which the latter is typically obtained by dividing by the label count mapped to a subset of the entire genome. The subset is one or more normalizing segments or chromosomes. Coverage amounts, whether normalized or not, can be corrected for the variation of the global profile from region to region in the genome, variations of the G-C fraction, strangers in the robust chromosomes, etc.
[00107] The term "Next generation sequencing (NGS)" here refers to sequencing methods that enable massively parallel sequencing of clonally amplified molecules and unique nucleic acid molecules. Non-limiting examples of NGS include synthesis by sequencing using reversible dye terminators and sequencing ligation.
[00108] The term "parameter" here refers to a numerical value that characterizes a property of a system. Often, a parameter numerically characterizes a set of quantitative data and / or a numerical relationship between a set of quantitative data. For example, a ratio (or function of a reason) between the number of sequence labels mapped to a chromosome and the length of the chromosome to which the labels are mapped, is a parameter.
[00109] The terms "threshold value" and "qualified threshold value" here refer to any number that is used as a cut to characterize a sample such as a test sample containing a nucleic acid from an organism suspected of having a condition doctor. The threshold can be compared with a parameter value to determine whether a sample giving rise to such a parameter value suggests that the organism has the medical condition. In certain embodiments, a qualified threshold value is calculated using a qualification data set and serves as a diagnostic limit for a variation in the number of copies, for example, an aneuploidy, in an organism. If a threshold is exceeded by the results obtained from the methods described here, a subject can be diagnosed with a variation in the number of copies, for example, trisomy 21. The appropriate threshold values for the methods described here can be identified by analyzing values normalized (for example chromosome doses, NCVs or NSVs) calculated for a sample training set. The threshold values can be identified using qualified (ie, unaffected) samples in a training set that comprises both qualified (ie, unaffected) samples and affected samples. Samples in the training set known to have chromosomal aneuploidies (ie, affected samples) can be used to confirm that the chosen thresholds are useful in differentiating affected from unaffected samples in a test set (see the Examples here). The choice of a threshold is dependent on the level of confidence that the user wishes to have to perform the classification. In some embodiments, the training set used to identify appropriate threshold values comprises at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90 , at least 100, at least 200, at least 300, at least 400, at least 500, at least 700, at least 800, at least 900, at least 1000, at least 2000, at least 3000, at least at least 4000 or more qualified samples. It may be advantageous to use larger qualified sample sets to improve the usefulness of diagnosing threshold values.
[00110] The term "bin" refers to a segment of a sequence or a segment of a genome. In some modalities, bins are contiguous within the genome or chromosome. Each bin can define a sequence of nucleotides in a reference genome. The bin sizes can be 1 kb, 100 kb, 1 Mb, etc., depending on the analysis required by the particular applications and density of the sequence label. In addition to their positions within a reference sequence, bins can have other characteristics such as sample coverage and sequence structure characteristics such as the G-C fraction.
[00111] The term "masking threshold" is used here to refer to an amount against which a value based on the number of sequence labels in a bin sequence is compared, in which a bin having a value exceeding the threshold of masking is masked. In some embodiments, the masking threshold can be a percentile rating, an absolute count, a mapping quality count, or other suitable values. In some modalities, a masking threshold can be defined as the percentile classification of a coefficient of variation across multiple, unaffected samples. In other embodiments, a masking threshold can be defined as a mapping quality count, for example, a MapQ count, which refers to the reliability of the alignment of the sequence readings for a reference genome. Note that a masking value threshold is different from the threshold value of a variation in the copy number (CNV), the latter being a cut to characterize a sample containing a nucleic acid from an organism suspected of having a CNV-related medical condition. In some embodiments, a CNV threshold value is defined in relation to a normalized chromosome value (NCV) or a normalized segment value (NSV) described here elsewhere.
[00112] The term "normalized value" here refers to a numerical value that refers to the number of sequence labels identified for the sequence (for example chromosome or chromosome segment) of interest for the number of sequence labels identified for a normalized sequence (eg normalizing chromosome or normalizing chromosome segment). For example, a "normalized value" can be a chromosome dose as described elsewhere here or it can be an NCV or it can be an NSV as described elsewhere.
[00113] The term "reading" refers to a sequence obtained from a portion of a nucleic acid sample. Typically, although not necessarily, a reading represents a short sequence of contiguous base pairs in the sample. The reading can be represented symbolically by the base pair sequence (in A, T, C or G) of the sample portion. It can be stored on a memory device and processed as appropriate to determine whether it matches a reference string or meets other criteria. A reading can be obtained directly from a sequencing device or indirectly from stored sequence information concerning the sample. In some cases, a reading is a DNA sequence of sufficient length (for example, at least about 25 base pairs) that can be used to identify a larger sequence or region, for example, which can be aligned and specifically designated for a chromosome or genomic region or gene.
[00114] The term "genomic reading" is used in reference to a reading of any segments in the individual's entire genome.
[00115] The term "sequence label" is used interchangeably with the term "mapped sequence label" to refer to a sequence reading that has been specifically designated, that is, mapped, to a larger sequence, for example, a reference genome, by alignment. The mapped sequence labels are uniquely mapped to a reference genome, that is, they are assigned to a single location in the reference genome. Unless otherwise specified, labels that map to the same sequence in a reference sequence are counted once. Labels can be supplied as data structures or other data assemblies. In certain embodiments, a label contains a reading sequence and associated information for this reading such as the location of the sequence in the genome, for example, the position on a chromosome. In certain embodiments, the location is specified for a positive filament orientation. A label can be defined to allow a limited amount of mismatches in alignment with a reference genome. In some modalities, labels that can be mapped to more than one location in a reference genome, that is, labels that do not map in a single way, may not be included in the analysis.
[00116] The term "non-redundant sequence label" refers to sequence labels that do not map to the same site, which is counted for the purpose of determining normalized chromosomal values (NCVs) in some modalities. Sometimes multiple sequence readings are aligned at the same location in a reference genome, producing redundant or duplicate sequence labels. In some embodiments, duplicate sequence labels that map to the same position are omitted or counted as a “non-redundant sequence label” for the purpose of determining NCVs. In some embodiments, non-redundant sequence labels aligned with non-excluded sites are counted to produce “non-excluded site counts” (NES counts) to determine NCVs. [00117] The term "site" refers to a unique position (ie, chromosomal ID, chromosomal position and orientation) in a reference genome. In some embodiments, a site can provide a location for a residue, a sequence label, or a segment in a sequence. [00118] "Excluded sites" are sites found in regions of a reference genome that have been excluded for the purpose of counting sequence labels. In some embodiments, excluded sites are found in regions of chromosomes that contain repetitive sequences, for example, centromeres and telomeres and regions of chromosomes that are common to more than one chromosome, for example, regions present on the Y chromosome that are also present in the chromosome X.
[00119] "Non-excluded sites" (NESs) are sites that are not excluded in a reference genome for the purpose of counting sequence labels. [00120] "Site excluded counts" (NES counts) are the numbers of sequence labels that are mapped to NESs in a reference genome. In some embodiments, NES counts are the numbers of non-redundant sequence labels mapped to NESs. In some embodiments, coverage and related parameters such as normalized coverage amounts, amounts of coverage removed from the overall profile and chromosome dose are based on NES counts. In one example, a chromosome dose is calculated as the ratio of the NES count to a chromosome of interest to the count for a normalizing chromosome.
[00121] The normalized chromosome value (NCV) refers to the coverage of a test sample for coverage of a set of training / qualified samples. In some modalities, the NCV is based on the chromosome dose. In some embodiments, NCV refers to the difference between chromosome doses of a chromosome of interest in a test sample and the average of the corresponding chromosome dose in a set of samples qualified as and can be calculated as: where H and 9i are the estimated mean and standard deviation, respectively, for the already chromosome dose in a set of qualified samples and xy is the already chromosomal ratio (dose) observed for test sample i.
[00122] In some embodiments, NCV can be calculated "on the fly" by relating the chromosome doses of a chromosome of interest in a test sample to the average of the corresponding chromosome dose in multiplexed samples sequenced in the same flow cells as : where Mj is the estimated average for the chromosome dose in a set of multiplexed samples sequenced in the same flow cell; is c σ) Standard view for the chromosome dose in one or more sets of multiplexed samples sequenced in one or more flow cells and Xij is the observed chromosome dose for test sample i.
In this embodiment, test sample i is one of the multiplexed samples sequenced in the same flow cell from which Mj is determined. [00123] For example, for chromosome 21 of interest in test sample A, which is sequenced as one of 64 samples multiplexed in a flow cell, the NCV for chromosome 21 in test sample A is calculated as the dose of chromosome 21 in the sample minus the average dose for chromosome 21 determined in the 64 multiplexed samples, divided by the dose standard deviation for chromosome 21 determined for the 64 samples multiplexed in flow cell 1 or additional flow cells.
[00124] As used herein, the terms "aligned", "alignment" or "align" refer to the process of comparing a reading or label to a reference sequence and thereby determining whether the reference sequence contains the reading sequence . If the reference sequence contains the reading, the reading can be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. In some cases, alignment simply tells you whether a reading is a member of a particular reference sequence or not (that is, whether the reading is present or absent in the reference sequence). For example, aligning a reading with the reference sequence for human chromosome 13 will tell you whether the reading is present in the reference sequence for chromosome 13. A tool that provides this information can be called a member tester. In some cases, an alignment additionally indicates a location in the reference sequence where the reading or label maps. For example, if the reference sequence is the entire human genomic sequence, an alignment may indicate that a reading is present on chromosome 13 and may also indicate that the reading is on a particular strand and / or site on chromosome 13.
[00125] Aligned readings or labels are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence of a reference genome. Alignment can be done manually, although it is typically implemented by a computer algorithm, since it would be impossible to align readings within a reasonable period of time to implement the methods described here. An example of a sequence alignment algorithm is the computer program Efficient Local Alignment of Nucleotide Data (ELAND) distributed as part of the Illumina Genomics Analysis data processing channel. Alternatively, a Bloom filter or similar set member tester can be used to align the readings with the reference genomes. See US Patent Application No. 61 / 552,374 filed October 27, 2011 which is incorporated herein by reference in its entirety. The pairing of a sequence reading in the alignment can be a sequence pairing of 100% or less than 100% (non-perfect pairing).
[00126] The term "mapping" used here refers to specifically designating a sequence reading for a larger sequence, for example, a reference genome, by alignment.
[00127] As used herein, the terms "reference genome" or "reference sequence" refer to any particular known genomic sequence, whether partial or complete, of any organism or virus that can be used for identified reference sequences from a subject. For example, a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov. A "genome" refers to the complete genetic information of an organism or virus, expressed in the sequence of nucleic acids.
[00128] In several modalities, the reference sequence is significantly longer than the readings that are aligned with it. For example, it can be at least about 100 times bigger or at least about 1000 times bigger or at least about 10,000 times bigger or at least about 105 times bigger or at least about 106 times bigger or at least about 107 times greater.
[00129] In one example, the reference sequence is that of a human-sized human genome. Such sequences can be referred to as reference genomic sequences. In another example, the reference sequence is limited to a specific human chromosome such as chromosome 13. In some embodiments, a reference to the y chromosome is the y chromosome sequence of the human genome version hg19. Such sequences can be referred to as reference chromosome sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, subchromosomal regions (such as filaments), etc., of any species.
[00130] In several modalities, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence can be taken from a particular individual.
[00131] The term "clinically relevant sequence" here refers to a nucleic acid sequence that is known or suspected of being associated or implicated with a genetic or disease condition. Determining the absence or presence of a clinically relevant sequence can be useful in determining a diagnosis or confirming a diagnosis of a medical condition or providing a prognosis for the development of a disease.
[00132] The term "derivative" when used in the context of a nucleic acid or a mixture of nucleic acids, here refers to the means by which the nucleic acid (s) is / are obtained ( s) from the source from which they originate. For example, in one embodiment, a mixture of nucleic acids that are derived from two different genomes means that nucleic acids, for example, cfDNA, have been naturally released by cells through naturally occurring processes such as necrosis or apoptosis. In another embodiment, a mixture of nucleic acids that are derived from two different genomes means that the nucleic acids were extracted from two different types of cells in a subject.
[00133] The term "based on" when used in the context of obtaining a specific quantitative value, here refers to the use of another quantity as an input to calculate the specific quantitative value as an output.
[00134] The term "patient sample" here refers to a biological sample obtained from a patient, that is, a recipient of medical attention, care or treatment. The patient sample can be any of the samples described here. In certain embodiments, the patient sample is obtained by non-invasive procedures, for example, a peripheral blood sample or a stool sample. The methods described here need not be limited to humans. Thus, various veterinary applications are considered in which case the patient sample may be a sample from a non-human mammal (for example, a feline, a porcine, an equine, a bovine and the like).
[00135] The term "mixed sample" here refers to a sample containing a mixture of nucleic acids, which are derived from different genomes.
[00136] The term "maternal sample" here refers to a biological sample obtained from a pregnant patient, for example, a woman.
[00137] The term "biological fluid" here refers to a liquid taken from a biological source and includes, for example, blood, serum, plasma, saliva, washing fluid, cerebrospinal fluid, urine, semen, sweat, tears, saliva and the like. As used herein, the terms "blood", "plasma" and "serum" expressly include fractions or processed portions thereof. Similarly, where a sample is taken from a biopsy, biological material taken with a swab, smear, etc., the “sample” expressly includes a processed fraction or portion derived from biopsy, biological material taken with a swab, smear, etc.
[00138] The terms "maternal nucleic acids" and "fetal nucleic acids" here refer to the nucleic acids of a pregnant female patient and the nucleic acids of the fetus being carried by the pregnant female, respectively.
[00139] As used herein, the term "corresponding to" sometimes refers to a nucleic acid sequence, for example, a gene or chromosome, which is present in the genome of different subjects and who do not necessarily have the same sequence in all genomes, but it serves to provide the identity rather than genetic information of a sequence of interest, for example, a gene or chromosome.
[00140] As used herein, the term "fetal fraction" refers to the fraction of fetal nucleic acids present in a sample comprising fetal and maternal nucleic acid. The fetal fraction is often used to characterize cfDNA in a mother's blood.
[00141] As used herein the term "chromosome" refers to the gene carrier that carries the inheritance of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized individual human genomic chromosome numbering system is used here.
[00142] As used herein, the term "polynucleotide length" refers to the absolute number of nucleotides in a sequence or region of a reference genome. The term "chromosome length" refers to the known length of the chromosome given in base pairs, for example, provided in the NCB136 / hg18 assembly of the human chromosome found in | genome | | ucscUedu / cigin / hgTracks hgsid = 167155613 & chromlnfoPage = on the internet.
[00143] The term "subject" here refers to a human subject as well as a non-human subject such as a mammal, an invertebrate, a vertebrate, a fungus, a yeast, a bacterium and a virus. Although the examples here refer to humans and the language is primarily aimed at human interests, the concepts described here are applicable to the genomes of any plant or animal and are useful in the fields of veterinary medicine, animal science, research laboratories and the like.
[00144] The term "condition" here refers to "medical condition" as a broad term that includes all diseases and disorders, but can include injuries and normal health conditions, such as pregnancy, that would affect a person's health, health care benefits or have implications for medical treatments.
[00145] The term "complete" when used in reference to a chromosomal aneuploidy here refers to a gain or loss of an entire chromosome.
[00146] The term "partial" when used in reference to a chromosomal aneuploidy here refers to a gain or loss of a portion, that is, a segment, of a chromosome.
[00147] The term "mosaic" here refers to the presence of two populations of cells with different karyotypes in an individual that grew from a single fertilized egg. Mosaicism can result from a mutation during development that is propagated to only a subset of adult cells.
[00148] The term "non-mosaic" here refers to an organism, for example, a human fetus, composed of cells from a karyotype.
[00149] The term "sensitivity" as used here refers to the probability that a test result will be positive when the condition of interest is present. It can be calculated as the number of true positives divided by the sum of true positives and false negatives.
[00150] The term "specificity" as used here refers to the probability that a test result will be negative when the condition of interest is absent. It can be calculated as the number of true negatives divided by the sum of true negatives and false positives.
[00151] The term "enrich" here refers to the process of amplifying the polymorphic target nucleic acids contained in a portion of a maternal sample and combining the amplified product with the rest of the maternal sample from which the portion was removed. For example, the rest of the maternal sample can be the original maternal sample.
[00152] The term “original maternal sample” here refers to an unenriched biological sample obtained from a pregnant patient, for example, a woman, which serves as the source from which a portion is removed to amplify the nucleic acids in the polymorphic target.
[00153] The "original sample" can be any sample obtained from a pregnant patient and its fractions processed, for example, a sample of purified cfDNA extracted from a sample of maternal plasma.
[00154] The term "initiator" as used herein refers to an isolated oligonucleotide that is capable of acting as a starting point for synthesis when placed under inductive conditions for the synthesis of an extension product (for example, conditions include nucleotides, an inducing agent such as DNA polymerase and an appropriate temperature and pH). The primer is preferably single-stranded for maximum amplification efficiency, but may alternatively be double-stranded. If double filament, the initiator is first treated to separate its filaments before being used to prepare the extension products. Preferably, the primer is an oligodeoxyribonucleotide. The initiator must be long enough to prepare the synthesis of extension products in the presence of the inducing agent. The exact lengths of the initiators will depend on many factors, including temperature, initiator source, use of the method and the parameters used to plan the initiator.
Introduction and Context [00155] CNV in the human genome significantly influences human diversity and predisposition to disease (Redon et al., Nature 23: 444-454 [2006], Shaikh et al. Genome Res 19: 1682-1690 [2009] ). Such diseases include, but are not limited to, cancer, infectious and autoimmune diseases, diseases of the nervous system, metabolic and / or cardiovascular diseases and the like.
[00156] CNVs have been known to contribute to genetic disease through different mechanisms, resulting in gene dosage imbalance or gene disruption in most cases. In addition to their direct correlation with genetic disorders, CNVs are known to mediate phenotypic changes that can be harmful. Recently, several studies have reported an increased burden of rare or new CNVs in complex disorders such as Autism, ADHD and schizophrenia when compared to normal controls, highlighting the potential pathogenicity of rare or unique CNVs (Sebat et al., 316: 445 - 449 [ 2007]; Walsh et al., Science 320: 539 - 543 [2008]). CNV arises from genomic rearrangements, primarily due to unbalanced deletion, duplication, insertion and translocation events. [00157] It has been shown that the fetal origin of cfDNA fragments are shorter, on average, than those of maternal origin. NIPT (non-invasive prenatal testing) based on NGS data has been successfully implemented. Current methodologies involve sequencing maternal samples using short readings (25 base pairs to 36 base pairs), genome alignment, computing and normalizing subchromosomal coverage and finally evaluating the over-representation of target chromosomes (13/18/21 / X / Y ) compared to the expected normalized coverage associated with a normal diploid genome. Thus, traditional NIPT testing and analysis relies on counts or coverage to assess the likelihood of fetal aneuploidy.
[00158] Since the maternal plasma samples represent a mixture of maternal and fetal cfDNA, the success of any given NIPT method depends on its sensitivity to detect changes in the copy number in the low fraction fetal samples. For count-based methods, their sensitivity is determined by (a) depth of sequencing and (b) ability to normalize data to reduce technical variation. This description provides analytical methodology for NIPT and other applications by deriving the size of fragment information, for example, from readings of paired ends and using this information in an analysis channel. The improved analytical sensitivity provides the ability to apply NIPT methods at reduced coverage (eg reduced depth of coverage) that makes it possible to use the technology for lower cost testing of average risk pregnancies.
[00159] Methods, apparatus and systems are described herein to determine the number of copies and variations in the number of copies (CNV) of different sequences of interest in a test sample comprising a mixture of nucleic acids derived from two or more different genomes and that are known or suspected of differing in the amount of one or more sequences of interest. The variations in the number of copies determined by the methods and apparatus described here include gains or losses of entire chromosomes, changes involving very large chromosome segments that are microscopically visible and an abundance of variation in the number of submicroscopic copies of DNA segments ranging from a single nucleotide, for kilobases (kb), for megabases (Mb) in size.
[00160] In some embodiments, methods are provided to determine the variation in the number of copies (CNV) of fetuses using maternal samples containing maternal and cell-free fetal DNA. Some implementations use the cfDNA fragment length (or fragment size) to improve sensitivity and specificity for the detection of fetal aneuploidy from cfDNA in maternal plasma. Some modalities are implemented with a free PCR library training linked with paired end DNA sequencing. In some embodiments, both fragment size and coverage are used to enhance the detection of fetal aneuploidy. In some embodiments, the methods involve combining independent counting of shorter fragments with the relative fraction of shorter fragments in bins across the genome.
[00161] Some modalities described here provide methods to improve the sensitivity and / or specificity of the analysis of sequence data by removing the trend of GC content within the sample. In some modalities, the removal of the trend of the GC content within the sample is based on the corrected sequence data for the common systematic variation through the unaffected training samples.
[00162] Some described modalities provide methods to derive parameters with high signal-to-noise ratio of cell-free nucleic acid fragments, to determine various genetic conditions related to copy number and CNV, with sensitivity, selectivity, and / or efficiency improved over conventional methods. Parameters include, but are not limited to, coverage weighted by fragment size, fraction or fragment ratio within a defined range, fragment methylation level, coverage statistics obtained, fetal fraction estimates obtained from coverage information, etc. The represented process was found to be particularly effective in improving the signal in samples having relatively low fractions of DNA from a genome under consideration (e.g., a genome from a fetus). An example of such a sample is a sample of maternal blood from a pregnant individual with false twins, triplets, etc., where the process assesses the variation in the number of copies in the genome of one of the fetuses.
[00163] In some modalities, high sensitivities and analytical specificities can be obtained with a simple library training using very low cfDNA input that does not require PCR amplification. The free PCR method simplifies workflow, improves turnaround times and eliminates trends that are inherent with PCR methods. In some modalities, the detection of fetal aneuploidy from maternal plasma can be made more robust and efficient than conventional methods, requiring less fragment of single cfDNA. In combination, improved analytical sensitivity and specificity are achieved with very fast response times on a significantly lower number of cfDNA fragments. This potentially allows NIPT to be performed at significantly lower costs to facilitate application in the general obstetric population.
[00164] In several implementations, the training of PCR-free library is possible with the described methods. Some implementations eliminate trends inherent to PCR methods, reduced test complexity, reduce the required sequencing depth (2.5X lower), provide faster response time, for example, one-day response, enable fetal fraction measurement (FF) in the process, facilitate discrimination between maternal and fetal / placental cfDNA using fragment size information.
CNV Assessment
Methods for determining CNV
[00165] Using the sequence coverage value, fragment size parameters, and / or methylation levels provided by the methods described here, one can determine the various genetic conditions related to the number of copies and CNV of sequences, chromosomes or chromosome segments with improved sensitivity, selectivity, and / or efficiency compared to using the sequence coverage values obtained by conventional methods. For example, in some embodiments, masked reference sequences are used to determine the presence or absence of any two or more different complete fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acid molecules. The exemplary methods provided below align the readings with the reference sequences (including reference genomes). Alignment can be performed in an unmasked or masked reference sequence, thereby producing sequence labels mapped to the reference sequence. In some embodiments, only sequence labels that fall into the unmasked segments of the reference sequence are considered to determine the variation in the number of copies.
[00166] In some modalities, the evaluation of a nucleic acid sample for CNV involves characterizing the situation of a chromosomal aneuploidy or segmenting by one of three types of qualifications: "normal" or "unaffected", "affected" and "without no qualification ”. The thresholds for normal and affected qualifications are typically set. A parameter related to aneuploidy or other variation in the number of copies is measured in a sample and the measured value is compared with the thresholds. For duplication-type aneuploidies, an affected qualification is made if a segmental chromosome or dose (or other measured value sequence content) is above a defined threshold set for affected samples. For such aneuploidies, a classification of normal is made if the chromosome or segmental dose is below a threshold set for normal samples. Unlike deletion-type aneuploidies, an affected qualification is made if a segmental chromosome or dose is below a defined threshold for affected samples and a normal qualification is made if a segmented chromosome or dose is above a set of sample thresholds. normal. For example, in the presence of trisomy the “normal” rating is determined by the value of a parameter, for example, a test chromosomal dose that is below a user-defined threshold of reliability and the “affected” rating is determined by a parameter , for example, a test chromosomal dose, which is above a user-defined threshold of reliability. A result of “no qualification” is determined by a parameter, for example, a test chromosomal dose that resides between the thresholds to make a “normal” or “affected” qualification. The term “without any qualification” is used interchangeably with “not classified”.
[00167] The parameters that can be used to determine CNV include, but are not limited to, coverage, slant / weighted fragment size coverage, fragment fraction or ratio in a defined size range and fragment methylation level. As discussed here, coverage is obtained from reading counts aligned with a region of a reference genome and optionally normalized to produce sequence label counts. In some embodiments, the sequence label counts can be weighted by the fragment size.
[00168] In some embodiments, a fragment size parameter is skewed to fragment sizes characteristic of one of the genomes. A fragment size parameter is a parameter that refers to the size of a fragment. A parameter is skewed to a fragment size when: 1) the parameter is favorably weighted for the fragment size, for example, a weighted count more expressively for the size than for other sizes; or 2) the parameter is obtained from a value that is favorably weighted for the fragment size, for example, a ratio obtained from a weighted count more expressively for the size. A size is characteristic of a genome when the genome has an enriched or higher concentration of nucleic acid in size compared to another genome or another portion of the same genome.
[00169] In some embodiments, the method for determining the presence or absence of any complete fetal chromosomal aneuploidies in a maternal test sample comprises (a) obtaining sequence information for fetal and maternal nucleic acids in the maternal test sample; (b) use the sequence information and the method described above to identify multiple sequence labels, amount of sequence coverage, a fragment size parameter or another parameter for each of the selected chromosomes of interest from chromosomes 1-22, X and Y and to identify multiple sequence labels or another parameter for one or more normalizing chromosomal sequences; (c) use the number of sequence labels or the other parameter identified for each of the chromosomes of interest and the sequence label numbers or the other parameter identified for each of the normalizing chromosomes to calculate a single chromosome dose for each chromosomes of interest; and (d) comparing each dose of chromosome to a threshold value and thereby determining the presence or absence of any complete fetal chromosomal aneuploidies in the maternal test sample.
[00170] In some embodiments, step (a) described above may comprise sequencing at least a portion of the nucleic acid molecules of a test sample to obtain said sequence information for the nucleic acid molecules of the fetal test sample and maternal. In some embodiments, step (c) comprises calculating a single chromosome dose for each of the chromosomes of interest such as the ratio of the number of sequence labels or the other parameter identified for each of the chromosomes of interest and the number of sequence or other parameter identified for normalization of the chromosomal sequence (s). In some other embodiments, the chromosome dose is based on the amounts of processed sequence coverage derived from the number of sequence labels or another parameter. In some embodiments, only unique, non-redundant sequence labels are used to calculate the amounts of sequence coverage processed or another parameter. In some embodiments, the amount of sequence coverage processed is a sequence label density ratio, which is the sequence label number standardized by the sequence length. In some embodiments, the amount of sequence coverage processed or the other parameter is a standardized sequence label or another standardized parameter, which is the number of sequence labels or the other parameter of a sequence of interest divided by that of the whole or a substantial portion of the genome. In some embodiments, the amount of sequence coverage processed or the other parameter such as a fragment size parameter is adjusted according to an overall profile of the sequence of interest. In some embodiments, the amount of sequence coverage processed or the other parameter is adjusted according to the correlation within the sample between the GC content and the sequence coverage for the sample being tested. In some embodiments, the amount of sequence coverage processed or the other parameter results from combinations of these processes, which are described here elsewhere.
[00171] In some embodiments, a chromosome dose is calculated as the ratio of processed sequence coverage or the other parameter for each of the chromosomes of interest and that for the normalizing chromosomal sequence (s) ).
[00172] In any of the above modalities, complete chromosomal aneuploidies are selected from complete chromosomal trisomies, complete chromosomal monosomies and complete chromosomal polysomies. Complete chromosomal aneuploidies are selected from complete aneuploidies of any of chromosomes 1-22, X and Y. For example, said different complete fetal chromosomal aneuploidies are selected from trisomy 2, trisomy 8, trisomy 9, trisomy 21, trisomy 21, trisomy 13, trisomy 16, trisomy 18, trisomy 22 „47.XXX, 47.XYY and monosomy X.
[00173] In any of the above modalities, steps (a) - (d) are repeated for test samples from different maternal subjects and the method comprises determining the presence or absence of any two or more different complete fetal chromosomal aneuploidies in each one of the test samples.
[00174] In any of the above modalities, the method can also comprise calculating a normalized chromosome value (NCV), in which the NCV refers to the chromosome dose for the average of the corresponding chromosome dose in a set of samples qualified as: where e are the estimated mean and standard deviation, respectively, for the already chromosome dose in a set of qualified samples and xy is the already chromosome dose for test sample i.
[00175] In some embodiments, NCV can be calculated “on the fly” by relating the chromosome dose of a chromosome of interest in a test sample to the average of the corresponding chromosome dose in multiplexed samples sequenced with the same flow cells such as: where Mj is the estimated mean for the chromosome dose in a set of multiplexed samples sequenced in the same flow cell; is the standard deviation for the chromosome dose in one or more sets of multiplexed samples sequenced in one or more flow cells and xy is the observed chromosome dose for test sample i.
[00176] In this embodiment, test sample i is one of the multiplexed samples sequenced in the same flow cell from which Mi is determined.
[00177] In some embodiments, a method is provided to determine the presence or absence of different partial fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method involves procedures analogous to the method for detecting complete aneuploidy as outlined above. However, instead of analyzing a complete chromosome, a segment of a chromosome is analyzed. See US Patent Application Publication No. 2013/0029852, which is incorporated by reference.
[00178] Figure 1 shows a method to determine the presence of the variation in the number of copies according to some modalities. The process 100 illustrated in Figure 1 uses sequence label coverage based on the number of sequence labels (i.e., the sequence label count) to determine CNV. However, similar to the description above for calculating an NCV, other variables or parameters, such as size, size ratio and methylation level, can be used instead of coverage. In some implementations, two or more variables are combined to determine a CNV. In addition, coverage and other parameters can be weighted based on the size of the fragments from which the labels are derived. For ease of reading, only the coverage is mentioned in the process 100 illustrated in Figure 1, but it should be noted that other parameters, such as size, size ratio and methylation level, weighted counting by size, etc. can be used in place of the cover.
[00179] In operations 130 and 135, the qualified sequence label coverages (or values of another parameter) and test sequence label coverages (or values of another parameter) are determined. The present description provides processes for determining amounts of coverage that provide improved sensitivity and selectivity over conventional methods. Operations 130 and 135 are marked by asterisks and emphasized by wavy line boxes to indicate that these operations contribute to improvement over the prior art. In some embodiments, the amounts of sequence label coverage are normalized, adjusted, cut and otherwise processed to improve the sensitivity and selectivity of the analysis. These processes are described here in yet another place.
[00180] From an overview perspective, the method makes use of normalized sequences of qualified training samples in determining the CNV of test samples. In some modalities, qualified training samples are unaffected and have a normal number of copies. Normalizing sequences provide a mechanism for normalizing measurements for intra-round and inter-round variability. Normalizing sequences are identified using sequence information from a set of qualified samples obtained from known subjects comprising cells having a normal copy number for any sequence of interest, for example, a chromosome or segment thereof. The determination of normalizing sequences is outlined in steps 110, 120, 130, 145 and 146 of the method modality shown in Figure 1. In some embodiments, normalizing sequences are used to calculate the sequence dose for the test sequences. See step 150. In some embodiments, normalizing sequences are also used to calculate a threshold against which the sequence dose of the test sequences is compared. See step 150. The sequence information obtained from the normalizing sequence and the test sequence is used to determine statistically significant identification of chromosomal aneuploidies in the test samples (step 160).
[00181] Returning to the details of the method to determine the presence of variation in the number of copies according to some modalities, Figure 1 provides a flow diagram 100 of a modality to determine a CNV of a sequence of interest, for example, a chromosome or segment thereof, in a biological sample. In some embodiments, a biological sample is obtained from a subject and comprises a mixture of the nucleic acids donated by the different genomes. Different genomes can be donated to the sample by the two individuals, for example, different genomes are donated by the fetus and the mother who carries the fetus. Also, different genomes can be contributed to the sample by three or more individuals, for example, different genomes are contributed by two or more fetuses and by the mother who carries the fetuses. Alternatively, genomes are contributed to the sample by aneuploid cancer cells and normal euploid cells from the same subject, for example, a plasma sample from a cancer patient.
[00182] In addition to analyzing a patient's test sample, one or more normalizing chromosomes or one or more normalizing chromosome segments are selected for each possible chromosome of interest. Chromosomes or normalizing segments are identified asynchronously from normal patient sample tests, which can occur in a clinical setting. In other words, chromosomes or normalizing segments are identified before testing patient samples. Associations between chromosomes or normalizing segments and chromosomes or segments of interest are stored for use during the test. As explained below, such an association is typically maintained over periods of time that span the test of many samples. The following debate concerns the modalities for selecting chromosomes or chromosome normalizing segments for individual chromosomes or segments of interest.
[00183] A set of qualified samples is obtained to identify qualified normalizing sequences and to provide variation values for use in determining the statistically significant identification of CNV in the test samples. In step 110, a plurality of qualified biological samples are obtained from a plurality of known subjects comprising cells having a normal copy number for any sequence of interest. In one embodiment, qualified samples are obtained from pregnant mothers with a fetus that has been confirmed using cytogenetic means to have a normal chromosome copy number. Qualified biological samples can be a biological fluid, for example, plasma or any suitable sample as described below. In some embodiments, a qualified sample contains a mixture of nucleic acid molecules, for example, cfDNA molecules. In some embodiments, the qualified sample is a sample of maternal plasma that contains a mixture of fetal and maternal cfDNA molecules. Sequence information for chromosomes and / or normalizing segments thereof is obtained by sequencing at least a portion of the nucleic acids, for example, fetal and maternal nucleic acids, using any known sequencing method.
Preferably, any of the Next Generation Sequencing (NGS) methods described elsewhere here is used to sequence fetal and maternal nucleic acids as single or clonally amplified molecules. In several embodiments, qualified samples are processed as described below before and during sequencing. They can be processed using devices, systems and kits as described here. [00184] In step 120, at least a portion of each of all qualified nucleic acids contained in the qualified samples are sequenced to generate millions of sequence readings, for example, 36 base pair readings, which are aligned with a genome reference, for example, hg18. In some embodiments, the sequence readings comprise about 20 base pairs, about 25 base pairs, about 30 base pairs, about 35 base pairs, about 40 base pairs, about 45 base pairs , about 50 base pairs, about 55 base pairs, about 60 base pairs, about 65 base pairs, about 70 base pairs, about 75 base pairs, about 80 base pairs, about 85 base pairs, about 90 base pairs, about 95 base pairs, about 100 base pairs, about 110 base pairs, about 120 base pairs, about 130, about 140 pairs base, about 150 base pairs, about 200 base pairs, about 250 base pairs, about 300 base pairs, about 350 base pairs, about 400 base pairs, about 450 base pairs base or about 500 base pairs. It is expected that technological advances will enable single-end readings of more than 500 base pairs, enabling readings of more than about 1000 base pairs when paired-end readings are generated. In one embodiment, the mapped sequence readings comprise 36 base pairs. In another embodiment, the mapped sequence readings comprise 25 base pairs.
[00185] Sequence readings are aligned with a reference genome and readings that are uniquely mapped to the reference genome are known as sequence labels. Sequence labels that fall on the masked segments of a masked reference sequence are not counted for the CNV analysis.
[00186] In one embodiment, at least about 3 x 106 qualified string labels, at least about 5 x 106 qualified string labels, at least about 8 x 106 qualified string labels, at least about 10 x 106 qualifying string labels, at least about 15 x 106 qualifying string labels, at least about 20 x 106 qualifying string labels, at least about 30 x 106 qualifying string labels, at least about 40 x 106 qualifying string labels qualified sequences or at least about 50 x 106 qualified sequence labels comprising readings between 20 and 40 base pairs are obtained from readings that map only to a reference genome.
[00187] In step 130, all labels obtained from the sequencing of nucleic acids in the qualified samples are counted to obtain a qualified sequence label coverage. Similarly, in operation 135, all labels obtained from a test sample are counted to obtain a test sequence label coverage. The present description provides processes for determining the amounts of coverage that provide improved sensitivity and selectivity over conventional methods. Operations 130 and 135 are marked by asterisks and emphasized by wavy line boxes to indicate that these operations contribute to improvement over the prior art. In some embodiments, the amounts of sequence label coverage are normalized, adjusted, cut and otherwise processed to improve the sensitivity and selectivity of the analysis. These processes are described here in yet another place.
[00188] Since all qualified sequence labels are mapped and counted in each of the qualified samples, the sequence label coverage for a sequence of interest, for example, a clinically relevant sequence, in the qualified samples is determined, as the are the sequence label covers for the additional sequences from which normalizing sequences are subsequently identified.
[00189] In some embodiments, the sequence of interest is a chromosome that is associated with a complete chromosomal aneuploidy, for example, chromosome 21 and the qualified normalizing sequence is a complete chromosome that is not associated with a chromosomal aneuploidy and whose variation in coverage sequence label is close to that of the sequence (i.e., chromosome) of interest, for example, chromosome 21. The selected normalizing chromosome (s) may be the one or group that best matches the variation in sequence label coverage of the sequence of interest. Any one or more of chromosomes 1-22, X and Y can be a sequence of interest and one or more chromosomes can be identified as the normalizing sequence for each of any of chromosomes 1-22, X and Y in the qualified samples. The normalizing chromosome can be an individual chromosome or it can be a group of chromosomes as described elsewhere herein.
[00190] In another embodiment, the sequence of interest is a segment of a chromosome associated with a partial aneuploidy, for example, a chromosomal deletion or insertion or unbalanced chromosomal translocation and the normalizing sequence is a chromosome segment (or group of segments ) that is not associated with partial aneuploidy and whose variation in sequence label coverage is close to that of the chromosome segment associated with partial aneuploidy. The selected normalizing chromosome segment (s) may be the one or more that best matches the variation in sequence label coverage of the sequence of interest. Any one or more segments of any one or more chromosomes 1-22, X and Y can be a sequence of interest.
[00191] In other embodiments, the sequence of interest is a segment of a chromosome associated with a partial aneuploidy and the normalizing sequence is an entire chromosome or chromosomes. In still other embodiments, the sequence of interest is an entire chromosome associated with an aneuploidy and the normalizing sequence is a segment or segments of chromosome that are not associated with aneuploidy. [00192] If a single sequence or a group of sequences is identified in the qualified samples as the normalizing sequence (s) for any one or more sequences of interest, the qualified normalizing sequence can be chosen to have a variation in sequence label coverage or a fragment size parameter that best or effectively approximates that of the sequence of interest as determined in qualified samples. For example, a qualified normalizing sequence is a sequence that produces the least variability across the qualified samples when used to normalize the sequence of interest, that is, the variability of the normalizing sequence is closer to that of the determined sequence of interest in the qualified samples. In other words, the qualified normalizing sequence is the sequence selected to produce the least variation in the sequence dose (for the sequence of interest) through the qualified samples. Thus, the process selects a sequence that when used as a normalizing chromosome is expected to produce the least round-to-round variability in the chromosome dose for the sequence of interest.
[00193] The normalizing sequence identified in the qualified samples for any one or more sequences of interest remains the normalizing sequence of choice to determine the presence or absence of aneuploidy in the test samples for days, weeks, months and possibly years, provided that the procedures needed to generate sequencing libraries and sequence the samples are essentially unchanged over time. As described above, normalizing sequences to determine the presence of aneuploidies are chosen for (possibly also among other reasons) the variability in the number of sequence labels or fragment size parameter values that are mapped to that between samples, for example, different samples and sequencing rounds, for example, sequencing rounds that occur on the same day and / or different days, which best approximates the variability of the sequence of interest for which it is used as a normalizing parameter. Substantial changes in these procedures will affect the number of labels that are mapped to all sequences, which in turn will determine that one or a group of sequences will have variability across samples in the same rounds and / or in different sequencing rounds, in the same different day or days that most closely approximates that of the sequence (s) of interest, which would require the set of normalizing sequences to be determined again. Substantial changes in procedures include changes to the laboratory protocol used to prepare the sequencing library, which includes changes related to sample preparation for multiplex sequencing rather than singleplex sequencing and changes to the sequencing platforms, which include changes in the chemistry used for sequencing.
[00194] In some embodiments, the normalizing sequence chosen to normalize a sequence of particular interest is a sequence that best distinguishes one or more qualified samples from one or more affected samples, which implies that the normalizing sequence is a sequence that has the highest differentiability, that is, the differentiability of the normalizing sequence is such that it provides optimal differentiation for a sequence of interest in an affected test sample to easily distinguish the affected test sample from other unaffected samples. In other modalities, the normalizing sequence is a sequence that has a combination of the least variability and the most differentiability.
[00195] The level of differentiability can be determined as a statistical difference between sequence doses, for example, chromosome doses or segmental doses, in a population of qualified samples and the chromosome dose (s) in one or more more test samples as described below and shown in the Examples. For example, differentiability can be represented numerically as a t-test value, which represents the statistical difference between chromosome doses in a population of qualified samples and the chromosome dose (s) in one or more test samples . Similarly, differentiability can be based on segmental doses rather than chromosome doses. Alternatively, differentiability can be represented numerically as a normalized chromosome value (NCV), which is a z count for chromosome doses as long as the distribution for NCV is normal. Similarly, in the case where chromosome segments are the sequences of interest, segmental dose differentiability can be represented numerically as a Normalized Segment Value (NSV), which is a z count for chromosome segment doses as long as the distribution for NSV is normal. In determining the z count, the mean and standard deviation of chromosome or segmental doses in a set of qualified samples can be used. Alternatively, the mean and standard deviation of chromosome or segmental doses in a training set comprising qualified samples and affected samples can be used. In other modalities, the normalizing sequence is a sequence that has the least variability and the greatest differentiability or an optimal combination of small variability and large differentiability.
[00196] The method identifies sequences that inherently have similar characteristics and that are prone to similar variations between samples and sequencing runs and that are useful for determining doses of sequence in the test samples.
Determination of sequence doses [00197] In some embodiments, chromosome or segmental doses for one or more chromosomes or segments of interest are determined in all qualified samples as described in step 146 shown in Figure 1 and a normalizing chromosome or segmental sequence is identified in step 145. Some normalizing sequences are provided before the sequence doses are calculated. Then one or more normalizing sequences are identified according to various criteria as described below, see step 145. In some embodiments, for example, the identified normalizing sequence results in the least variability in the sequence dose for the sequence of interest across all qualified samples.
[00198] In step 146, based on the calculated qualified label densities, a qualified sequence dose, that is, a chromosome dose or a segmental dose, for a sequence of interest is determined as the ratio of the sequence label coverage for the sequence of interest and qualified sequence label coverage for additional sequences from which normalizing sequences are subsequently identified in step 145. The identified normalizing sequences are subsequently used to determine the sequence doses in the test samples.
[00199] In one embodiment, the sequence dose in qualified samples is a chromosome dose that is calculated as the ratio of the number of sequence labels or fragment size parameter to a chromosome of interest and the number of sequence labels for a normalizing chromosome sequence in a qualified sample. The sequence of the normalizing chromosome can be a single chromosome, a group of chromosomes, a segment of a chromosome or a group of different chromosome segments. Consequently, a chromosome dose for a chromosome of interest is determined in a sample qualified as the ratio of the number of labels to a chromosome of interest and the number of labels for (i) a normalizing chromosome sequence composed of a single chromosome, ( ii) a normalizing chromosome sequence composed of two or more chromosomes, (iii) a normalizing segment sequence consisting of a single segment of a chromosome, (iv) a normalizing segment sequence consisting of two or more segments forms a chromosome or (v) a normalizing segmental sequence composed of two or more segments of two or more chromosomes. Examples for determining a chromosome dose for chromosome 21 of interest according to (i) - (v) are as follows: chromosome doses for the chromosome of interest, for example, chromosome 21, are determined as a reason for the chromosome 21 sequence label coverage and one of the following sequence label coverings: (i) each of all remaining chromosomes, i.e., chromosomes 1-20, chromosome 22, X chromosome and Y chromosome; (ii) all possible combinations of two or more remaining chromosomes; (iii) a segment of another chromosome, for example, chromosome 9; (iv) two segments of another chromosome, for example, two segments of chromosome 9; (v) two segments of two different chromosomes, for example, a segment of chromosome 9 and a segment of chromosome 14.
[00200] In another embodiment, the sequence dose in qualified samples is a segmental dose as opposed to a chromosome dose, which segmental dose is calculated as the ratio of the number of sequence labels to a segment of interest, which does not be an entire chromosome and the number of sequence labels for a normalizing segment sequence in a qualified sample. The normalizing segment sequence can be any of the normalizing chromosomes or segment sequences discussed above.
Identification of normalizing sequences [00201] In step 145, a normalizing sequence is identified for a sequence of interest. In some embodiments, for example, the normalizing sequence is the sequence based on the calculated sequence doses, for example, which result in the least variability in the sequence dose for the sequence of interest across all qualified training samples. The method identifies sequences that inherently have similar characteristics and are prone to similar variations between samples and sequencing runs and that are useful for determining the sequence doses in the test samples.
[00202] The normalizing sequences for one or more sequences of interest can be identified in a set of qualified samples and the sequences that are identified in the qualified samples are subsequently used to calculate sequence doses for one or more sequences of interest in each of the test samples (step 150) to determine the presence or absence of aneuploidy in each of the test samples. The normalizing sequence identified for the chromosomes or segments of interest may differ when different sequencing platforms are used and / or when differences exist in the purification of the nucleic acid that is to be sequenced and / or enabling the sequencing library. The use of normalizing sequences according to the methods described here provides specific and sensitive measures of a variation in the number of copies of a chromosome or segment of it regardless of the training platform and / or sample sequencing that is used.
[00203] In some embodiments, more than one normalizing sequence is identified, that is, different normalizing sequences can be determined for a sequence of interest and multiple sequence doses can be determined for a sequence of interest. For example, the variation, for example, the coefficient of variation (CV = standard deviation / mean), in the chromosome dose for the chromosome of interest 21 is minimal when chromosome 14 sequence label coverage is used. However, two, three, four, five, six, seven, eight or more normalizing sequences can be identified for use in determining a sequence dose for a sequence of interest in a test sample. As an example, a second dose for chromosome 21 in any test sample can be determined using chromosome 7, chromosome 9, chromosome 11 or chromosome 12 as the normalizing chromosome sequence as these chromosomes all have CV close to that for the chromosome 14.
[00204] In some embodiments, when a single chromosome is chosen as the normalizing chromosome sequence for a chromosome of interest, the normalizing chromosome sequence will be a chromosome that results in the chromosome doses for the chromosome of interest that have the least variability across of all tested samples, for example, qualified samples. In some cases, the best normalizing chromosome may not have the minimum variation, but it may have a qualified dose distribution that best distinguishes a test sample or samples from qualified samples, that is, the best normalizing chromosome may not have the lowest variation , but it can have the greatest differentiability.
[00205] In some embodiments, normalizing sequences include one or more autosomally robust sequences or segments thereof In some embodiments, robust autosomes include all autosomes except for the chromosome (s) of interest. In some embodiments, robust autosomes include all autosomes except for X, Y, 13, 18 and 21 chromosomes. In some embodiments, robust autosomes include all autosomes except those determined from a sample to be shifted from a normal diploid state , which can be useful in determining cancerous genomes that have an abnormal copy number in relation to a normal diploid genome.
Determination of aneuploidies in test samples [00206] Based on the identification of the normalizing sequence (s) in the qualified samples, a dose of sequence is determined for a sequence of interest in a test sample comprising a mixture of nucleic acids derived from genomes that differ in one or more sequences of interest.
[00207] In step 115, a test sample is obtained from a suspected or known subject carrying a clinically relevant CNV of a sequence of interest. The test sample can be a biological fluid, for example, plasma or any suitable sample as described below. As explained, the sample can be obtained using a non-invasive procedure such as a simple blood collection. In some embodiments, a test sample contains a mixture of nucleic acid molecules, for example, cfDNA molecules. In some embodiments, the test sample is a sample of maternal plasma that contains a mixture of fetal and maternal cfDNA molecules.
[00208] In step 125, at least a portion of the test the nucleic acids in the test sample are sequenced as described for samples qualified to generate millions of sequence readings, for example, 36 base pair readings. In various embodiments, 2x36 base pair readings of the paired end are used for paired end sequencing. As in step 120, the readings generated from sequencing the nucleic acids in the test sample are uniquely mapped or aligned with a reference genome to produce labels. As described in step 120, at least about 3 x 106 qualified sequence labels, at least about 5 x 106 qualified sequence labels, at least about 8 x 106 qualified sequence labels, at least about 10 x 106 labels qualified string labels, at least about 15 x 106 qualified string labels, at least about 20 x 106 qualified string labels, at least about 30 x 106 qualified string labels, at least about 40 x 106 string labels qualified or at least about 50 x 106 qualified sequence labels comprising readings between 20 and 40 base pairs are obtained from readings that map uniquely to a reference genome. In certain modalities, the readings produced by the sequencing apparatus are provided in an electronic format. Alignment is performed using computational devices as discussed below. The individual readings are compared against the reference genome, which is often vast (millions of base pairs) to identify sites where the readings uniquely correspond to the reference genome. In some modalities, the alignment procedure allows limited mismatch between readings and the reference genome. In some cases, 1, 2 or 3 base pairs in a reading are allowed to pair with the corresponding base pairs in a reference genome and a mapping is still done.
[00209] In step 135, all or most of the labels obtained from the sequencing of nucleic acids in the test samples are counted to determine a test sequence label coverage using one of the computational devices as described below. In some modalities, each reading is aligned with a particular region of the reference genome (a chromosome or segment in most cases) and the reading is converted to a label by attaching site information to the reading. As this process unfolds, computational devices can maintain a working count of the number of labels / readings mapping to each region of the reference genome (chromosome or segment in most cases). The counts are stored for each chromosome or segment of interest and each corresponding normalizing chromosome or segment. [00210] In certain embodiments, the reference genome has one or more excluded regions that are part of a true biological genome, but are not included in the reference genome. Readings that potentially align with excluded regions are not counted. Examples of excluded regions include regions of long repeat sequences, regions of similarity between X and Y chromosomes, etc. Using a masked reference sequence obtained by the masking techniques described above, only labels on the unmasked segments of the reference sequence are taken into account for the CNV analysis.
[00211] In some modalities, the method determines whether to count a label more than once when multiple readings align to the same site in a reference genome or sequence. There may be occasions when two labels have the same sequence and, therefore, align to an identical site in a reference sequence. The method used to count labels may under certain circumstances exclude identical labels from the same sequenced sample from counting. If a disproportionate number of labels is identical in a given sample, it suggests that there is a strong trend or other defect in the procedure. Therefore, according to certain modalities, the counting method does not count labels from a given sample that are identical to the sample labels that were previously counted.
[00212] Several criteria can be established to choose when disregarding an identical label from a single sample. In certain modalities, a defined percentage of the labels that are counted must be unique. If more labels than this threshold are not unique, they are disregarded. For example, if the defined percentage requires that at least 50% be unique, identical labels are not counted until the percentage of unique labels exceeds 50% for the sample. In other modalities, the threshold number of unique labels is at least about 60%. In other embodiments, the threshold percentage for single labels is at least about 75% or at least about 90% or at least about 95% or at least about 98% or at least about 99%. A threshold can be adjusted to 90% for chromosome 21. If 30M of labels are aligned to chromosome 21, then at least 27M of them must be unique. If 3M of counted labels are not unique and the 30 million and the first label are not unique, the same is not counted. The choice of the particular threshold or other criteria used to determine when not to count other identical labels can be selected using appropriate statistical analysis. One factor influencing this threshold or other criterion is the relative amount of sample sequenced for the size of the genome to which the labels can be aligned. Other factors include the size of the readings and similar considerations.
[00213] In one embodiment, the number of test sequence labels mapped to a sequence of interest is normalized to the known length of a sequence of interest to which they are mapped to provide a test sequence label density ratio . As described for qualified samples, normalization to the known length of a sequence of interest is not required and can be included as a step to reduce the number of digits in a number to simplify it for human interpretation. Since all mapped test sequence labels are counted in the test sample, the sequence label coverage for a sequence of interest, for example, a clinically relevant sequence, in the test samples is determined, as are the sequence label for additional sequences that correspond to at least one normalizing sequence identified in the qualified samples.
[00214] In step 150, based on the identity of at least one normalizing sequence in the qualified samples, a dose of test sequence is determined for a sequence of interest in the test sample. In various embodiments, the test sequence dose is computationally determined using the sequence label covers of the sequence of interest and the corresponding normalizing sequence as described herein. The computational devices responsible for this enterprise will electronically access the association between the sequence of interest and its associated normalizing sequence, which can be stored in a database, table, graph or be included as code in program instructions.
[00215] As described elsewhere, the at least one normalizing sequence can be a single sequence or a group of sequences. The sequence dose for a sequence of interest in a test sample is a ratio of the sequence label coverage determined to the sequence of interest in the test sample and the sequence label coverage of at least one normalizing sequence determined in the sample. test, where the normalizing sequence in the test sample corresponds to the normalizing sequence identified in the samples qualified for the sequence of particular interest. For example, if the normalizing sequence identified for chromosome 21 in qualified samples is determined to be a chromosome, for example, chromosome 14, then the test sequence dose for chromosome 21 (sequence of interest) is determined as the coverage ratio sequence label for chromosome 21 in and in the sequence label cover for chromosome 14 each determined in the test sample. Similarly, chromosome doses for chromosomes 13, 18, X, Y and other chromosomes associated with chromosomal aneuploidies are determined. A normalizing sequence for a chromosome of interest can be one or a group of chromosomes or one or a group of chromosome segments. As previously described, a sequence of interest can be part of a chromosome, for example, a chromosome segment. Consequently, the dose for a chromosome segment can be determined as the ratio of the sequence label coverage determined for the segment in the test sample and the sequence label coverage for the normalizing chromosome segment in the test sample, where the normalizing segment in the test sample corresponds to the normalizing segment (single or a group of segments) identified in the samples qualified for the segment of particular interest. Chromosome segments can vary from kilobases (kb) to megabases (Mb) in size (for example, about 1 kb to 10 kb or about 10 kb to 100 kb or about 100 kb to 1 Mb).
[00216] In step 155, the threshold values are derived from standard deviation values established for the qualified sequence doses determined in a plurality of qualified samples and the sequence dose determined for known samples to be aneuploid for a sequence of interest. Note that this operation is typically performed asynchronously with analysis of patient test samples. It can be performed, for example, concurrently with the selection of normalized sequences for qualified samples. The precise classification depends on the differences between the probability distributions for the different classes, that is, type of aneuploidy. In some examples, the thresholds are chosen from the empirical distribution for each type of aneuploidy, for example, trisomy 21. The possible threshold values that have been established to classify the aneuploidies of trisomy 13, trisomy 18, trisomy 21 and monosomy X as described in Examples, which describe the use of the method to determine chromosomal aneuploidies by sequencing cfDNA extracted from a maternal sample comprising a mixture of fetal and maternal nucleic acids. The threshold value that is determined to distinguish samples affected for an aneuploidy from a chromosome can be the same or it can be different from the threshold for a different aneuploidy. As shown in the Examples, the threshold value for each chromosome of interest is determined from the variability in the dose of the chromosome of interest across the samples and sequencing runs. The less variable the chromosome dose for any chromosome of interest, the narrower the spread in the dose for the chromosome of interest across all unaffected samples, which are used to adjust the threshold to determine different aneuploidies.
[00217] Returning to the process flow associated with the classification of a patient test sample, in step 160, the variation in the number of copies of the sequence of interest is determined in the test sample by comparing the test sequence dose for the sequence of interest with at least an established threshold value of the qualified sequence doses. This operation can be performed by the same computational devices used to measure sequence label coverings and / or calculate segment doses.
[00218] In step 160, the dose calculated for a test sequence of interest is compared with that established as the threshold values that are chosen according to a "threshold of reliability" defined by the user to classify the sample as a "normal" , one “affected” or one “without any qualifications”. “No qualification” samples are samples for which a definitive diagnosis cannot be reliably made. Each type of affected sample (for example, trisomy 21, partial trisomy 21, monosomy X) has its own thresholds, one to qualify normal samples (unaffected) and another to qualify affected samples (although in some cases the two thresholds coincide ). As described elsewhere elsewhere, under some circumstances one with no qualification can be converted to a qualification (affected or normal) if the fetal fraction of the nucleic acid in the test sample is high enough. The classification of the test sequence can be reported by the computational devices used in other operations of this process flow. In some cases, the classification is reported in an electronic format and can be displayed, sent by email, sent as a text message via cell phone, etc. for interested people.
[00219] In some embodiments, the determination of CNV comprises calculating an NCV or NSV that refer to the chromosome or segmental dose in relation to the average of the corresponding chromosome or segmental dose in a set of qualified samples as described above. Then the CNV can be determined by comparing the NCV / NSV with a predetermined copy number threshold.
[00220] The threshold for evaluating the number of copies can be chosen to optimize the rate of false positives and false negatives. The higher the threshold for assessing the number of copies, the less likely a false positive will occur. Similarly, the lower the threshold, the less likely a false negative will occur. Thus, an exchange exists between a first ideal threshold above which only true positives are classified and a second ideal threshold below which only true negatives are classified.
[00221] Thresholds are widely adjusted depending on the variability in chromosome doses for a particular chromosome of interest as determined in a set of unaffected samples. The variability is dependent on several factors, including the fraction of fetal cDNA present in a sample. Variability (CV) is determined by the mean or median and standard deviation for chromosome doses across a population of unaffected samples. Thus, the bmiar (s) for classifying aneuploidy uses NCVs, according to: (where and 9i are the estimated means and standard deviation, respectively, for the already chromosome dose in a set of qualified samples and xy is the the observed chromosome dose for the z-test sample.) with an associated fetal fraction such as: [00222] Thus, for each NCV of a chromosome of interest, an expected fetal fraction associated with the given NCV value can be calculated from of the CV based on the mean and standard deviation of the chromosome ratio for the chromosome of interest across a population of unaffected samples.
[00223] Subsequently, based on the relationship between the fetal fraction and the NCV values, a decision threshold can be chosen above which the samples are determined to be positive (affected) based on the normal distribution amounts. As described above, in some modalities, a threshold is established for the ideal trade-off between the detection of true positives and the rate of false negative results. Namely, the threshold is chosen to maximize the sum of true positives and true negatives or to minimize the sum of false positives and false negatives. [00224] Certain modalities provide a method for providing prenatal diagnosis of a fetal chromosomal aneuploidy in a biological sample comprising fetal and maternal nucleic acid molecules. The diagnosis is made based on obtaining sequence information from at least a portion of the mixture of fetal and maternal nucleic acid molecules derived from a biological test sample, for example, a sample of maternal plasma, computing from the data of sequencing a normalizing chromosome dose for one or more chromosomes of interest, and / or a normalizing segment dose for one or more segments of interest and determining a statistically significant difference between the chromosome dose for the chromosome of interest and / or the dose from segment to segment of interest, respectively, in the test sample and a threshold value established in a plurality of qualified (normal) samples and provide the prenatal diagnosis based on the statistical difference. As described in step 160 of the method, a diagnosis of normal or affected is made. A “no qualification” is provided in the event that the diagnosis for normal or affected cannot be made safely.
[00225] In some modalities, two thresholds can be chosen. A first threshold is chosen to minimize the false positive rate, above which samples will be classified as "affected" and a second threshold is chosen to minimize the false negative rate, below which samples will be classified as "unaffected" . Samples having NCVs above the second threshold but below the first threshold can be classified as "suspected aneuploidy" or "unqualified" samples, for which the presence or absence of aneuploidy can be confirmed by independent means. The region between the first and second thresholds can be referred to as an “unqualified” region. [00226] In some modalities, suspicious and unqualified thresholds are shown in Table 1. As can be seen, NCV thresholds vary across different chromosomes. In some modalities, the thresholds vary according to the FF for the sample as explained above. The threshold techniques applied here contribute to improved sensitivity and selectivity in some modalities. TABLE 1. Suspected and Affected NCV Thresholds in Parentheses the Unqualified Tracks Analysis of Fragment Size and Sequence Coverage [00227] As mentioned above, fragment size parameters, as well as coverage, can be used to assess the CNV. The fragment size of a cell-free nucleic acid fragment, for example, a cfDNA fragment can be obtained by paired end sequencing, electrophoresis (e.g., microchip-based capillary electrophoresis) and other methods known in the art. Figure 2A thematically illustrates how paired end sequencing can be used to determine both fragment size and sequence coverage.
[00228] The top half of Figure 2A shows a diagram of a cell-free fetal DNA fragment and a cell-free maternal DNA fragment providing a pattern for a paired end sequencing process. Conventionally, long nucleic acid sequences are fragmented into shorter sequences to be read in a paired end sequencing process. Such fragments are also referred to as inserts. Fragmentation is not necessary for cell-free DNA because it already exists in fragments that are mostly shorter than 300 base pairs. Cell-free fetal DNA fragments in maternal plasma have been shown to be longer than maternal DNA fragments. As shown at the top of figure 2A, cell-free DNA of fetal origin has an average length of about 167 base pairs, while cell-free DNA of maternal origin has an average length of about 175 base pairs. In paired end sequencing on certain platforms, such as the sequencing of Illumina by the synthesis platform as further described below, adapter sequences, index sequences, and / or preparatory sequences are linked to the two ends of a fragment (not shown in Figure 2A ). A fragment is first read in one direction, providing the reading 1 of one end of the fragment. Then a second reading starts from the opposite end of the fragment, providing reading 2 of the sequence. The correspondence between reading 1 and reading 2 can be identified by their coordinates in the flow cell. Then reading 1 and reading 2 are mapped with a reference sequence as a pair of labels that are close to each other, as shown in the lower half of Figure 2A. In some embodiments, if the readings are long enough, the two readings may overlap in the middle portion of the insert. After the pair is aligned with the reference sequence, the relative distance between the two readings and the fragment length can be determined from the positions of the two readings. Because paired end readings provide twice as many base pairs as single end readings of the same reading length, they help improve alignment qualities, especially for sequences with many repetitions or non-unique sequences. In many embodiments, a reference sequence is divided into bins, such as 100 K base pair bins. After the paired end readings are aligned with the reference sequence, the number of readings aligned for a bin can be determined. The number as well as the insert lengths (for example, cfDNA fragment) can also be determined for a bin. In some embodiments, if an insert transposes two bins, half of an insert can be assigned to each bin.
[00229] Figure 2B shows an embodiment providing process 220 for using size-based coverage to determine a variation in the number of copies of a nucleic acid sequence of interest in a test sample including cell-free fragments of nucleic acid that originate from two or more genomes. As described here, a parameter is “skewed to a fragment size or stripe size” when: 1) the parameter is favorably weighted to the fragment size or stripe size, for example, a more expressively weighted count when associated with fragments band size or size than for other sizes or bands; or 2) the parameter is obtained from a value that is favorably weighted for the fragment size or stripe size, for example, a ratio obtained from a weighted count more expressively when associated with fragments of stripe size or size . A fragment size or stripe size can be a characteristic of a genome or a portion thereof when the genome produces nucleic acid fragments enriched in or having a higher concentration of size or stripe size relative to the nucleic acid fragments of a another genome or another portion of the same genome.
[00230] Process 220 begins by receiving the sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample. See block 222. The two or more genomes in the test sample can be a genome of a pregnant mother and a genome of a fetus carried by the pregnant mother. In other applications, the test sample includes cell-free DNA from tumor cells and unaffected cells. In some embodiments, because of the high signal-to-noise ratio provided by the size-slanted coverage, sequencing of cell-free nucleic acid fragments is performed without the need to amplify the nucleic acid fragments using PCR. Process 200 further involves aligning the sequence readings of the cell-free nucleic acid fragments with a reference genome that includes the sequence of interest and is divided into a plurality of bins. Successful alignment results on the test sequence labels, which include the sequence and its location in the reference sequence. See block 224. Then process 220 proceeds by determining the sizes of the cell-free nucleic acid fragments in the test sample. Some modalities applying paired end sequencing provide the length of an insert associated with a sequence label. See block 226. The terms "size" and "length" are used interchangeably when they are used with reference to the sequencing of nucleic acids or fragments. In the embodiment illustrated here, process 220 further involves weighting the test sequence labels based on the sizes of the cell-free nucleic acid fragments from which the labels are obtained. See block 228. As used herein, “weighting” refers to the modification of a quantity using one or more variables or functions. The one or more variables or functions are considered a "weight". In many modalities, variables are multiplied by weight. In other modalities, the variable can be modified spontaneously or in another way. In some embodiments, the weighting of the test sequence labels is carried out by tending to cover the test sequence labels obtained from cell-free nucleic acid fragments of a genome size or band size characteristic of the genome sample. test. As described herein, a size is characteristic of a genome when the genome has an enriched or higher size concentration of the nucleic acid in relation to another genome or another portion of the same genome.
[00231] In some modalities, the weighting function can be a linear or non-linear function. Examples of applicable non-linear functions include, but are not limited to, the Heaviside step function, box-car functions, stair functions or sigmoidal functions. In some modalities, a Heaviside function or a box-car function is used, such that a label in a specific range size is multiplied by a weight of 1 and out-of-range labels are multiplied by a weight of 0. In some modalities , fragments between 80 and 150 base pairs are given a weight of 1, while fragments outside this range are given a weight of 0. In these examples, the weighting is discrete, with zero or one depending on whether the parameter at all the value falls within or outside a particular range. Alternatively, weights are calculated as a continuous function of fragment size or another aspect of the associated parameter value.
[00232] In some modalities, the weights for fragments in one band size are positive and those in another band are negative. This can be used to help enhance the signal when the difference directions between two genomes have the opposite signs. For example, reading counts have a weight of 1 for the insert from 80 to 150 base pairs and a weight of -1 for the insert from 160 to 200 base pairs.
[00233] Weights can be given as counts, as well as other parameters. For example, weighting can also be applied to fractional parameters or ratios that use fragment size. For example, the ratio can give fragments in certain sub-bands with greater weight than fragments in other size bins.
[00234] The coverages are then calculated for the bins based on the weighted test sequence labels. See block 230. Such covers are considered inclined to size. As explained above, a value is skewed to a fragment size or stripe size when the parameter is favorably weighted to the fragment size or stripe size. Process 200 also involves identifying a variation in the number of copies in the sequence of interest for the calculated coverage. See block 232. In some embodiments, as explained further below in connection with Figures 2C, 3A-3K and 4, the covers can be adjusted or corrected to remove noise in the data, thereby increasing the signal-to-noise ratio. In some applications, coverage based on the weighted labels obtained in process 220 provides both a higher sensitivity and / or a higher selectivity compared to unweighted coverage in determining the variation in the number of copies. In some applications, the sample workflow provided below can further improve the sensitivity and selectivity for CNV analysis.
Example Workflow for Analyzing Fragment Size and / or Sequence Coverage [00235] Some of the described modalities provide methods for determining the amounts of low noise and / or high signal sequence coverage, providing data to determine various related genetic conditions with the number of copies and CNV with improved sensitivity, selectivity, and / or efficiency in relation to the amounts of sequence coverage obtained by conventional methods. In certain embodiments, the sequences of a test sample are processed to obtain amounts of sequence coverage.
[00236] The process makes use of certain information available from other sources. In some implementations, all of this information is obtained from a training set of samples known to be unaffected (for example, non-aneuploid). In other modalities, some or all of the information is obtained from other test samples, which can be provided “on the spot” since multiple samples are analyzed in the same process.
[00237] In certain embodiments, sequence masks are used to reduce data noise. In some modalities, both the sequence of interest and its normalizing sequences are masked. In some embodiments, different masks can be used when different chromosomes or segments of interest are considered. For example, a mask (or group of masks) can be used when chromosome 13 is the chromosome of interest and a different mask (or group of masks) can be used with chromosome 21 is the chromosome of interest. In certain modalities, masks are defined in the resolution of bins. Therefore, in one example, the mask resolution is 100 kb. In some embodiments, a separate mask can be applied to the Y chromosome. The masked exclusion regions for the Y chromosome can be provided in a finer resolution (1kb) than for other chromosomes of interest, as described in Provisional US Patent Application No. 61 / 836.057, deposited on June 17, 2013 [certificate of representative no. ARTEPOO8P]. The masks are provided in the form of files identifying the excluded genomic regions.
[00238] In certain modalities, the process uses a normalized coverage expectation value to remove the variation from bin to bin in the profile of a sequence of interest, variation that is not informative for the determination of CNV for the test sample. The process adjusts normalized coverage amounts according to the expected normalized coverage value for each bin across the entire genome or at least the robust chromosome bins in the reference genome (for use in operation 317 below). Parameters other than coverage can also be improved by this process. The expectation value can be determined from an unaffected sample training set. As an example, the expectation value can be an average value across the training set samples. The expected values of the sample coverage can be determined as the number of unique non-redundant labels aligned to a bin divided by the total number of unique non-redundant labels aligned for all bins on the robust chromosomes of the reference genome.
[00239] Figure 2C represents a flowchart of a process 200 for determining a fragment size parameter for a sequence of interest, which parameter is used to evaluate the number of copies of the sequence of interest in a test sample in block 214 This process removes the common systematic variation through unaffected training samples, a variation that increases the noise in the analysis for the evaluation of CNV. It also removes GC trends specific to a test sample, thereby increasing the signal-to-noise ratio in the data analysis. It is noteworthy that process 200 can also be applied to the cover, regardless of whether the cover is slanted by size or not. Similarly, the processes in Figures 2D, 3 and 4 are equally applicable to coverage, coverage weighted by fragment size, fragment size, fraction or fragment ratio in a defined size range, fragment methylation level, etc.
[00240] Process 200 begins by providing sequence readings from the test sample as indicated in block 202. In some embodiments the sequence readings are obtained by sequencing DNA segments obtained from a blood of the pregnant woman including the mother's cfDNA and of the fetus. The process goes on to align the sequence readings for a reference genome including the sequence of interest, to provide test sequence labels. Block 204. In some modalities, readings that are aligned to more than one site are excluded. In some modalities, multiple readings are aligned to the same location and are excluded or reduced to a single reading count. In some modalities, readings aligned to exclude sites are also excluded. Therefore, in some modalities, only uniquely aligned, non-redundant labels aligned to non-excluded sites are counted to provide an excluded site count (NES count) to determine coverage or other parameters for each bin.
[00241] Process 200 provides sizes of the cell-free nucleic acid fragments in the test sample. In some embodiments using paired end sequencing, an insert size / length can be obtained from the locations of a pair of readings at the ends of the insert. Other techniques can be used to determine fragment size. See block 205. Then, in reference genome bins, including bins in the sequence of interest, process 200 determines values of an inclined fragment size parameter for characteristic fragment sizes of one of the genomes. The term "fragment size parameter" refers to a parameter that refers to the size or length of a fragment or a collection of fragments of nucleic acid fragments; for example, a fragment of cfDNA obtained from a body fluid. As used herein, a parameter is “skewed to a fragment size or stripe size” when: 1) the parameter is favorably weighted to the fragment size or stripe size, for example, a more expressively weighted count when associated with fragments band size or size than for other sizes or bands; or 2) the parameter is obtained from a value that is favorably weighted for the fragment size or stripe size, for example, a ratio obtained from a weighted count more expressively when associated with fragments of stripe size or size . A fragment size or stripe size can be a characteristic of a genome or a portion thereof when the genome produces nucleic acid fragments enriched in or having a higher concentration of size or stripe size relative to the nucleic acid fragments of a another genome or another portion of the same genome.
[00242] In some embodiments, the fragment size parameter is a size-weighted count. In some embodiments, a fragment is weighted 1 in a range and 0 out of range. In other embodiments, the fragment size parameter is a fraction or fragment ratio in a strip size. See block 206. In some embodiments, the value of the fragment size parameter (or cover, as mentioned above) for each bin is divided by the value of the parameter of the normalizing sequence in the same sample, providing a normalized parameter.
[00243] Process 200 then provides an overall profile of the sequence of interest. The global profile comprises an expected parameter value in each bin obtained from a training set of unaffected training samples. Block 208. Process 200 removes the common variation in the training sample by adjusting the normalized parameter values of the test sequence labels according to the expected parameter values to obtain a corrected global parameter profile value for the sequence of interest. Block 210. In some modalities, the expected value of the parameter obtained from the training set provided in block 208 is an average of through the training samples. In some modalities, the 2010 operation adjusts the normalized value of the parameter by subtracting the expected value of the parameter from the normalized value of the parameter. In other modalities, operation 210 divides the normalized value of the parameter by the expected value of the parameter of each bin to produce the corrected value of the global profile of the parameter.
[00244] In addition to or instead of the global profile correction, process 200 removes specific GC trends for the test sample by adjusting the parameter value. As shown in block 212, the process adjusts the parameter value corrected by the global profile based on the relationship between the GC content level and the coverage corrected by the global profile in the test sample, thus obtaining a value corrected by the GC of the fragment size parameter sample. After adjusting for the common systematic variation in unaffected training samples and GC trends within the subject, the process provides the corrected fragment size value for the overall profile and / or GC variation, which is used to assess CNV of the sample with improved sensitivity and specificity. In some implementations, the fragment size value can be adjusted using a principal component analysis method to remove variation components unrelated to the variation in the number of copies of the sequence of interest as further described with reference to block 719 of Figure 2F . In some implementations, the fragment size value can be cured by removing foreign bins from within a sample as described with reference to blocks 321 of Figure 3A.
Multiple Pass Process for Determining the Number of Copies Using Multiple Parameters [00245] As emphasized above, the processes described here are suitable for determining CNV using multiple parameters, including, but not limited to coverage, fragment weighted coverage, size fragment, fraction or fragment ratio within a defined size range, fragment methylation level, etc. Each of these parameters can be processed separately to individually contribute to determining the final variation in the number of copies.
[00246] In some embodiments, similar processes can be applied to a size-weighted coverage analysis and a fragment size analysis, both of which are fragment size parameters. Figure 2D shows a flowchart of two overlapping passages of workflow 600, passage 1 for size-weighted coverage and passage 2 for fragment size analysis. In another embodiment not shown here, the methylation level can be processed in an additional pass. The two passages may include comparable operations to obtain adjusted coverage information, on which the CNV determination is based.
[00247] An initial single pass portion of the process begins by receiving the sequencing data, see block 602 and continues through the computation counts as described above, see block 612. After this point, the represented process is divided into two steps as described above. Returning to the initial portion of the process, the workflow converts the sequencing data into sequence readings. When the sequencing data is derived from multiplex sequencing, the sequence readings are also demultiplexed to identify the source of the data. See block 604. The sequence readings are then aligned with a reference sequence, where the aligned sequence readings are provided as sequence labels. See block 606. Then the sequence labels are filtered to obtain non-excluded sites (NESs), which are sequence labels unambiguously mapped, not duplicated. Sequence labels are organized into bins of specific sequence length, such as 1 kb, 100 kb or 1 Mb. See block 610. In some modalities involving analysis of specific syndrome regions, the bins are 100 kb. In some embodiments, bins exhibiting high variability can be masked using a mask sequence obtained from a plurality of samples not affected in a manner as described in Figure 3A, block 313. Then the labels on the NESs are counted to provide the coverages to be standardized and adjusted for the analysis of CNV. See block 612.
[00248] In the represented modality, operations 604, 606, 610 and 612 are performed once and most of the remaining operations are performed twice, once for an analysis of coverage weighted by size (passage 1) and once for a fragment size analysis (passage 2). In other modalities, one or more of the operations shown to be performed in two passes are performed only once and the results are shared in both processes. Examples of such shared operations include operations 614, 616 and 618.
[00249] In the modalities represented, the coverage obtained (counts weighted by size) or fragment size parameter (fractions or size ratios) of NESs are normalized, for example, by dividing the NES value of a bin by the total NESs of the genome or a set of normalizing chromosomes. In some modalities, only the coverage is normalized, while the fragment size parameter does not need to be normalized, because it is not affected by the sequencing depth in the same way as the coverage. See block 614. Then, in some modalities, the common variation for a training set including unaffected samples is removed, a variation that is not related to the CNV of interest. In the represented modality, the common variation is represented as a global wave profile obtained from samples not affected in a similar way to the global wave profile described above. In some embodiments as illustrated in Figure 6, the unaffected samples used to obtain an overall wave profile include samples that originate from the same flow cell or processing batch. See block 616. The calculation of the flow cell specific global wave is further explained below. In the represented modality, after the global wave profile has been removed, the covers are corrected for the GC level on a specific sample basis. See block 616. Some algorithms for GC correction are described in more detail below in the text associated with Figure 3A, block 319.
[00250] In the modality represented, both in passage 1 for the analysis of weighted coverage and in passage 2 for fragment size analysis, the data can be filtered further regarding the specific noise for an individual sample, for example, data from strange bins that have extremely different coverage from other bins can be removed from the analysis, a difference that cannot be attributed to the variation in the number of copies of interest. See block 622. This filtering operation within the sample can correspond to block 321 in Figure 3A.
[00251] In some embodiments, after single sample filtering, the weighted values of the coverage of passage 1 and the fragment size parameter of passage 2 are both enriched in the target signal in relation to the reference. See blocks 624 and 628. Then, the coverage and fragment size parameter for the chromosome are each used to calculate a chromosome dose and a normalized chromosome value (NCV) as described above. The NCV can then be compared to a criterion to determine a count indicating a probability for a CNV. See blocks 626 and 630. The counts of the two passages can then be combined to provide a composite, final count, which determines whether an aneuploidy should be qualified. In some modalities, the counts of 626 and 630 are t-test statistics or Z values. In some modalities, the final count is a chi-square value. In other embodiments, the final count is an average of the square root of the two t-values or z counts. Other means of combining the two counts of the two paths can be used to improve overall sensitivity and selectivity in the detection of CNV. Alternatively, you can combine the two counts of the two passed by the logical operations, for example, AND operation or OU operation. For example, when a high sensitivity is preferred to ensure low false negative, a CNV rating can be made when the count of pass 1 OR pass 2 reaches a call criterion. On the other hand, if high selectivity is desired to guarantee low false positive, a CNV qualification can be made only if the count of both passage 1 AND (and) passage 2 reaches a calling criterion.
[00252] It is remarkable that there is an exchange between sensitivity and selectivity using such logic operations above. In some embodiments, a two-step sequencing method is applied to overcome the exchange as further described below. In summary, the initial count of a sample is compared against a relatively low first threshold, designed to increase sensitivity and if the sample scores higher than the first threshold, it goes through a second round of sequencing, which is deeper than the first. Such a sample is then reprocessed and analyzed in a workflow similar to the one described above. The resulting score is then compared to a second, relatively high threshold designed to improve sensitivity. In some modalities, the samples undergo a second round of relatively low sequencing scores among those who score above the first threshold, thereby reducing the number of samples that need to be sequenced again.
[00253] In some modalities, a 3rd pass using a 3rd parameter can be used. An example of this 3rd pass is methylation. Methylation can be determined directly by measuring methylation of the sample's nucleic acids or indirectly as a parameter that correlates with the cell-free nucleic acid fragment size.
[00254] In some embodiments, this 3rd parameter is a 2nd parameter based on coverage or count, where the counts are based on the fragment sizes outside the primary fragment size used in the first parameter based on the count. When fragments between 80 and 150 base pairs are used to generate the count or coverage parameter, they exclude about 70% of readings from a sequencing. To the degree that these excluded readings still have some potentially useful sign, they can be used in a 3rd parameter that includes the excluded readings or readings in a fraction based on the size that is outside or overlaps with the fraction based on the size used in the first parameter. In this regard, the readings and associated values of coverage taken from the excluded fragments may be given a lower weight. In other words, the parameter of variation of the number of copies calculated using these readings can be attributed less importance in making a final qualification of the variation of the number of copies. Alternatively, as described above, labels outside the strip size in the first parameter can take a negative value when the two genomes have opposite characteristics in the two strip sizes.
[00255] In various implementations, the covers in processes 200, 220 and 600 are slanted for fragment labels at a shorter end of a fragment size spectrum. In some embodiments, the covers are slanted to label fragments of sizes shorter than a specified value. In some embodiments, the covers are slanted for fragment labels in a fragment size range and the upper end of the strip is about 150 base pairs or less.
[00256] In various implementations of processes 200, 220 and 600, sequence readings are obtained by sequencing the cell-free nucleic acid fragments without first using PCR to amplify the nucleic acids of the cell-free nucleic acid fragments. In various embodiments, sequencing readings are obtained by sequencing the cell-free nucleic acid fragments to a depth no greater than about 6 M fragments per sample. In some embodiments, the sequencing depth is no greater than about 1 M fragments per sample. In some embodiments, sequencing readings are obtained by multiplex sequencing and the number of multiplexed samples is at least about 24.
[00257] In various implementations of processes 200, 220 and 600, the test sample comprises plasma from an individual. In some embodiments, the processes further comprise obtaining the cell-free nucleic acid from the test sample. In some embodiments, the processes further comprise sequencing the cell-free nucleic acid fragments that originate from two or more genomes.
[00258] In several implementations of processes 200, 220 and 600, the two or more genomes comprise genomes of a mother and a fetus. In some implementations, the variation in the number of copies in the sequence of interest comprises aneuploidy in the genome of the fetus.
[00259] In some implementations of processes 200, 220 and 600, the two or more genomes comprise genomes of cancer and somatic cells. In some implementations, processes comprising using a variation in the number of copies in the cancerous genome to diagnose cancer, monitor cancer progress, and / or determine a treatment for cancer. In some implementations, the variation in the number of copies causes a genetic abnormality.
[00260] In some implementations of processes 200, 220 and 600, the covers are slanted for fragment labels at a longer end of a fragment size spectrum. In some implementations, covers are skewed for fragment labels that are longer than a specified value. In some implementations, the covers are slanted for fragment labels in a fragment size range and where the lowest end of the range is about 150 base pairs or more.
[00261] In some implementations of processes 200, 220 and 600, the processes also involve: determining, in bins of the reference genome, including the sequence of interest, methylation levels of cell-free nucleic acid fragments in said bins and using methylation levels, in addition to or instead of calculated coverages or fragment size parameter values to identify a variation in the number of copies. In some implementation, using methylation levels to identify a variation in the number of copies involves providing an overall methylation profile for the bins of the sequence of interest. The overall methylation profile includes expected levels of methylation in at least bins of the sequence of interest. In some implementations, expected levels of methylation are obtained from cell-free nucleic acid fragment lengths in an unaffected training sample training set comprising the sequenced and substantially aligned nucleic acids as the sample's nucleic acid fragments the expected methylation levels showing variation from bin to bin. In some implementations, the processes involve adjusting the value of the methylation levels using the expected levels of methylation in the bins of at least the sequence of interest, thus obtaining corrected values of the overall profile of the methylation levels for the sequence of interest, the processes involving also identify a variation in the number of copies using corrected coverages in the global profile and the corrected levels in the global methylation profile. In some implementations, identifying a variation in the number of copies using the corrected coverages in the global profile and the corrected methylation levels in the global profile also includes: adjusting the corrected coverage in the global profile and the corrected levels in the global methylation profile based on the levels of GC content, thereby obtaining corrected GC coverage and GC corrected values of methylation levels for the sequence of interest; and to identify a variation in the number of copies using the corrections covered in GC and the levels corrected in methylation GC.
[00262] In some implementations of processes 200, 220 and 600, the fragment size parameter comprises a fraction or ratio including a portion of the cell-free nucleic acid fragments in the test sample having shorter or longer fragment sizes than than a threshold value. In some implementations, the fragment size parameter includes a fraction including (i) several fragments in the test sample within a first size range including 110 base pairs and (ii) several fragments in the test sample within a second range in size comprising the first strip size and sizes outside the first strip size.
Determining the number of copies using a three-pass process, probability ratios, t-statistics, and / or fetal fractions [00263] Figure 2E shows a flow chart of a three-pass process to assess the number of copies. This includes three overlapping flow work passages 700, which includes passage 1 (or 713A) analysis of the coverage of readings associated with fragments of all sizes, passage 2 (or 713B) analysis of the coverage of readings associated with shorter fragments and passage 3 (or 713C) analysis of the relative frequency of shorter readings in relation to all readings.
[00264] Process 700 is similar to process 600 in its global organization. The operations indicated by blocks 702, 704, 706, 710, 712 can be performed in the same way or in a similar way to the operations indicated by blocks 602, 604, 606 and 610 and 612. After the reading counts are obtained, the coverage is determined using readings of fragments of all sizes in passage 713A. Coverage is determined using readings of the short fragments in passage 713B. The frequency of readings of the short fragments in relation to all readings is determined in passage 713C. The relative frequency is also indicated as a ratio of the size or a fraction of the size anywhere contained herein. This is an example of a characteristic fragment size. In some implementations, short fragments are fragments shorter than about 150 base pairs. In various implementations, the short fragments can be in the size ranges of about 50 to 150, 80 to 150 or 110 to 150 base pairs. In some implementations, the third pass or the 713C pass is optional.
[00265] The data from the three passages 713A, 713B and 713C are all submitted to normalization operations 714, 716, 718, 719 and 722 to remove the variation not related to the number of copies of the sequence of interest. These normalization operations are boxed in blocks 723. Operation 714 involves normalizing the analyzed quantity of the sequence of interest by dividing the analyzed quantity by the total value of the quantity of the reference sequence. This normalization step uses the values obtained from a test sample. Similarly, operations 718 and 722 normalize the quantity analyzed using values obtained from the test sample. Operations 716 and 719 use values obtained from an unaffected sample training set.
[00266] Operation 716 removes the variation of a global wave obtained from the unaffected samples training set, which uses the same method or methods similar to those described with reference to block 616. Operation 718 removes the variation of the specific GC variation using the same methods and similar ways as described with reference to block 618.
[00267] Operation 719 removes another variation using a principal component analysis method (PCA). The variation removed by the PCA methods is due to factors unrelated to the number of copies of the sequence of interest. The quantity analyzed in each bin (coverage, fragment size ratio, etc.) provides an independent variable for the PCA and the samples from the unaffected training set provide values for these independent variables. Training set samples fully include samples having the same number of copies of the sequence of interest, for example, two copies of a somatic chromosome, one copy of the X chromosome (when male samples are used as unaffected samples) or two copies of the X chromosome (when female samples are used as unaffected samples). Thus, the variation in the samples does not result from an aneuploidy or other difference in the number of copies. The training set PCA produces major components that are not related to the number of copies of the sequence of interest. The main components can then be used to remove the variation in a test sample unrelated to the number of copies of the sequence of interest.
[00268] In some embodiments, the variation of one or more of the main components is removed from the data of the test samples using the estimated coefficients of the data from the unaffected samples in a region outside the sequence of interest. In some implementations, the region represents all robust chromosomes. For example, a PCA is performed on the normalized bin coverage data of normal training samples, thereby providing the main components that correspond to the dimensions in which most of the variation in the data can be captured. The variation thus captured is not related to the variation in the number of copies in the sequence of interest. After the main components have been obtained from normal training samples, these are applied to the test data. A linear regression model with a test sample as the response variable and main components as dependent variables is generated through the bins of a region outside the sequence of interest. The resulting regression coefficients are used to normalize the bin coverage of the region of interest by subtracting the linear combination of the main components defined by the estimated regression coefficients. This removes the variation not related to CNV from the sequence of interest. See block 719. The residual data is used for the downstream analysis. In addition, operation 722 removes extraneous data points using methods described with reference to block 622.
[00269] After submitting to normalization operations in block 723, the coverage values of all bins were "normalized" to remove sources of variation other than aneuploidy or other variations in the number of copies. In a sense, the bins of the sequence of interest are enriched or changed with respect to other bins for purposes of detecting variation in the number of copies. See block 724, which is not an operation, but represents the resulting coverage values. Large block normalization operations 723 can increase the signal and / or reduce the noise of the quantity under analysis. Similarly, the short fragment coverage values for the bins have been normalized to remove sources of variation other than aneuploidy or other variations in the number of copies as shown in block 728 and the relative frequency of short fragments (or size) for the bins was similarly normalized to remove sources of variation other than aneuploidy or other variations in the number of copies as shown in block 732. As with block 724, blocks 728 and 732 are not operations, but represent the coverage and relative frequency values after processing large block 723. It should be understood that operations on large block 723 can be modified, rearranged or removed. For example, in some embodiments, the PCA 719 operation is not performed. In other modalities, the correction for the GC 718 operation is not performed. In other modalities, the order of operations is changed; for example, the PCA 719 operation is performed before correcting the GC 718 operation.
[00270] The coverage of all fragments after normalization and removal of the variation shown in block 724 is used to obtain a t statistic in block 726. Similarly, the coverage of short fragments after normalization and removal of the variation shown in block 728 is used to obtain a t statistic in block 730 and the relative short fragment frequency after normalization and removal of the variation shown in block 732 is used to obtain a t statistic in block 734.
[00271] Figure 2F demonstrates why extending a t statistic to the copy number analysis can help improve the accuracy of the analysis. Figure 2F shows, in each panel, the frequency distributions of the normalized bin coverage of a sequence of interest and a reference sequence, with the distribution of the sequence of interest overlapping and obscuring the distribution of the reference sequence. In the top panel, the bin coverage for a sample having a larger coverage is shown, having more than 6 million readings; on the bottom panel, the bin coverage for a sample having a lower coverage is shown, having less than 2 million readings. The horizontal axis indicates the normalized coverage with respect to the average coverage of the reference sequence. The vertical axis indicates the relative probability density with respect to the number of bins having the average coverage values. Figure 2F is thus a type of histogram. The distribution for the sequence of interest is shown forward and the distribution of the reference sequence is shown backward. The mean for the distribution of the sequence of interest is less than that of the reference sequence, indicating a decreased number of copies in the sample. The average difference between the sequence of interest and the reference sequence is similar for the high coverage sample on the top panel and the low coverage sample on the bottom panel. In this way, the difference in the average can, in some implementations, be used to identify a variation in the number of copies in the sequence of interest. Note that the distributions of the high coverage sample have smaller variations than those of the low coverage sample. Using only the averages to distinguish the two distributions does not capture the difference between the two distributions as well as using both the mean and the variation. A t-statistic can reflect both the mean and the variation of the distribution.
[00272] In some implementations, operations 726 calculate a t-statistic as follows: where X1 is the bin coverage of the sequence of interest, x2 being the bin coverage of the reference region / sequence, s1 being the standard deviation of the coverage of the sequence of interest, n2 being the standard deviation of the coverage of the reference region, n1 being the number of bins in the sequence of interest; and n2 being the number of bins in the reference region.
[00273] In some implementations, the reference region includes all robust chromosomes (for example, chromosomes other than those most likely to contain an aneuploidy). In some implementations, the reference region includes at least one chromosome outside the sequence of interest. In some imitations, the reference region includes robust chromosomes not including the sequence of interest. In other implementations, the reference region includes a set of chromosomes (for example, a subset of chromosomes selected from the robust chromosomes) which have been determined to provide the best signal selection capability for a set of training samples. In some embodiments, the signal detection capability is based on the ability of the reference region to discriminate between variations that house bins in the number of copies of bins that do not contain variations in the number of copies. In some embodiments, the reference region is identified in a manner similar to that used to determine a "normalized sequence" or "normalized chromosome" as described in the section entitled "Identification of Normalization Sequences".
[00274] Returning to Figure 2E, one or more estimates of the fetal fraction (block 735) can be combined with any of the t statistics in block 726, 730 and 734 to obtain a probable estimate for a ploidy case. See block 736. In some implementations, one or more fetal fractions in block 740 are obtained through either process 800 in Figures 2G, process 900 in Figure 2H or process 1000 in Figure 21. The processes can be implemented in parallel using a workflow as workflow 1100 in Figure 2J.
[00275] Figure 2G shows an exemplary process 800 for determining the fetal fraction of the coverage information according to some implementations of the description. Process 800 begins by obtaining coverage information (for example, sequence dose values) from the training samples of a training set. See block 802. Each sample of the training set is obtained from a pregnant woman known to be carrying a male fetus. Namely, The sample contains the cfDNA of the male fetus. In some implementations, operation 802 may achieve standardized sequence coverage in ways other than the sequence dose as described herein or may obtain other coverage values.
[00276] Process 800 then involves calculating the fetal fractions of the training samples. In some implementations, the fetal fraction can be calculated from the sequence dose values: where Rxí is the sequence dose for a male sample, the mean (Rxi) being the average of the dose sequences for the sex samples feminine. In other implementations the mean or other measures of central tendency can be used. In some implementations, FF can be achieved by other methods, such as the relative frequency of X and Y chromosomes. See block 804.
[00277] Process 800 also involves splitting the reference sequence into multiple substring bins. In some implementations, the reference sequence is a complete genome. In some implementations, bins are 100 kb bins. In some implementations, the genome is divided into about 25,000 bins. The process then obtains the covers for the bins. See block 806. In some implementations, the covers used in block 806 are obtained after undergoing the normalization operations shown in block 1123 of Figure 2J. In other implementations, covers of different size ranges can be used.
[00278] Each bin is associated with the sample covers in the training set. Therefore, for each bin a correlation can be obtained between sample coverage and fetal sample fractions. The 800 process involves obtaining the correlations between the fetal fraction and the coverage for all bins. See block 808. Then the processes select the bins having correlation values above a threshold. See block 810. In some implementations, bins having the highest correlation values of 6000 are selected. The purpose is to identify the bins that demonstrate a high correlation between coverage and fetal fraction in the training samples. Then, the bins can be used to predict the fetal fraction in the test sample. Although the training samples are male samples, the correlation between fetal fraction and coverage can be generalized for male and female test samples.
[00279] Using the selected bins having high correlation values, the process obtains a linear model related to the fetal fraction for coverage. See block 812. Each selected bin provides an independent variable for the linear model. Therefore, the obtained linear model also includes a parameter or weight for each bin. The weights of the bins are adjusted to fit the model for the data. After obtaining the linear model, process 800 involves applying the coverage data from the test sample to the model to determine the fetal fraction for the test sample. See block 814. The coverage data applied from the test sample are for bins that have high correlations between the fetal fraction and the coverage.
[00280] Figure 2J shows workflow 1100 for processing the sequence reading information which can be used to obtain fetal fraction estimates. Workflow 1100 shares similar processing steps as workflow 600 in Figure 2D. Blocks 1102, 1104, 1106, 1110, 1112, 1123, 1114, 1116, 1118 and 1122 respectively correspond to blocks 602, 604, 606, 610, 612, 623, 614, 616, 618 and 622. In some implementations, a or more normalization operations in block 123 are optional. Passage 1 provides coverage information, which can be used in block 806 of process 800 shown in Figure 2G. Process 800 can then produce an estimated fetal fraction 1150 in Figure 2J.
[00281] In some implementations, a plurality of fetal fraction estimates (for example, 1150 and 1152 in Figure 2J) can be combined to provide an estimate of the composite fetal fraction (for example, 1154). Several methods can be used to obtain fetal fraction estimates. For example, the fetal fraction can be obtained from coverage information. See block 1150 of Figure 2J and process 800 of Figure 2G. In some implementations, the fetal fraction can also be estimated from the fragment size distribution. See block 1152 of Figure 2J and process 900 of Figure 2H. In some implementations, the fetal fraction can also be estimated from the 8-mer frequency distribution. See block 1152 of Figure 2J and process 1000 of Figure 21.
[00282] In a test sample including male fetus cfDNA, the fetal fraction can also be estimated from the coverage of the Y chromosome and / or the X chromosome. In some implementations, a composite estimate of the fetal fraction (see , for example, block 1155) for a putatively male fetus is obtained using the information selected from the group consisting of: a fetal fraction obtained from the information on the coverage of bins, a fetal fraction obtained from the fragment size information, a fetal fraction obtained from the Y chromosome cover, a fetal fraction obtained from the X chromosome and any combination thereof. In some implementations, the putative sex of the fetus is obtained using the Y chromosome cover. Two or more fetal fractions (for example, 1150 and 1152) can be combined in several ways to provide a composite estimate of the fetal fraction (for example, example, 1155). For example, an average or weight average method can be used in some implementations, where weighing can be based on the statistical confidence of the fetal fraction estimate.
[00283] In some implementations, an estimate of the fetal fraction composite for a putatively female fetus is obtained using information selected from the group consisting of: a fetal fraction obtained from the bins coverage information, a fetal fraction obtained from from fragment size information and any combinations of these.
[00284] Figure 2H shows a process for determining the fetal fraction from the size distribution information according to some implementations. Process 900 begins by obtaining coverage information (for example, sequence dose values) for male training samples from a training set. See block 902. Process 900 then involves calculating the fetal fractions of the training samples using the methods described above with reference to block 804. See block 904.
[00285] Process 900 proceeds to split a size range into a plurality of bins to provide bins based on fragment size and determine the read frequencies for the bins based on fragment size. See block 906. In some implementations, the frequencies of the bins based on fragment size are obtained without normalizing the factors shown in block 1123. See path 1124 in Figure 2J. In some implementations, the frequencies of the bins based on the size of the fragment are obtained after optionally experimenting with the normalization operations shown in block 1123 of Figure 2J. In some implementations, the size range is divided into 40 bins. In some implementations, the bin at the bottom end includes fragments of sizes smaller than about 55 base pairs. In some implementations, the bottom terminated bin includes size fragments in the range of about 50 to 55 base pairs, which excludes information for readings shorter than 50 base pairs. In some implementations, the bin at the highest termination includes fragments larger than about 245 base pairs. In some implementations, the bin at the top end includes fragments of size in the range of about 245 to 250 base pairs, which excludes information for readings longer than 250 base pairs.
[00286] Process 900 proceeds to obtain a linear fetal fraction model with respect to the frequencies of the readings for the bins based on the fragment size, using the data from the training samples. See block 908. The linear model obtained includes the independent variables for the frequencies of the readings of the bins based on size. The model also includes a parameter or weight for each bin based on size. The weights of the bins are adjusted to fit the model for the data. After obtaining the linear model, process 900 involves applying the frequency reading data from the test sample to the model to determine the fetal fraction for the test sample. See block 910.
[00287] In some implementations, a frequency of 8-mers can be used to calculate the fetal fraction. Figure 21 shows an exemplary process 1000 for determining the fetal fraction information of the 8-month frequency according to some implementations of the description. Process 1000 begins by obtaining coverage information (for example, sequence dose values) from male training samples in a training set. See block 1002. Process 1000 then involves calculating the fetal fractions of the training samples using any of the methods described for block 804. See block 1004.
[00288] Process 1000 also involves obtaining the frequencies of 8-mers (for example, all possible permutations of the 4 nucleotides in 8 positions) from the readings of each training sample. See block 1006. In some implementations, up to 65,536 or close to those many 8-mers and their frequencies are obtained. In some implementations, the frequencies of 8-mers are obtained without normalization by factors shown in block 1123. See path 1124 in Figure 2J. In some implementations, the frequency of 8-mers is obtained after optionally performing the normalization operations shown in block 1123 of Figure 2J.
[00289] Each 8-mers is associated with the sample frequencies in the training set. Therefore, for each 8-mers a correlation can be obtained between the frequencies of the 8-mers samples and the fetal fractions of the samples. Process 1000 involves obtaining the correlations between the fetal fraction and the frequency of 8-mers for all 8-mers. See block 1008. Then, the process selects the 8-mers having correlation values above a threshold. See block 1010. The purpose is to identify the 8-mers that demonstrate a high correlation between the frequency of 8-mers and the fetal fraction in the training samples. Then, the bins can be used to predict the fetal fraction in the test sample. Although the training samples are male samples, the correlation between the fetal fraction and the frequency of 8-mers can be generalized for the male and female test samples.
[00290] Using the selected 8-mers having high correlation values, the process obtains a linear model related to the fetal fraction for the frequency of 8-mers. See block 1012. Each selected bin provides an independent variable for the linear model. Therefore, the obtained linear model also includes a parameter or weight for each bin. After obtaining the linear model, process 1000 involves applying the 8-month frequency data from the test sample to the model to determine the fetal fraction for the test sample. See block 1014.
[00291] Returning to Figure 2E, in some implementations, process 700 involves obtaining a probability of final ploidy in operation 736 using the t-statistic based on the coverage of the total fragments provided by operation 726, the fetal fraction estimate provided by operation 726 and the t-statistic based on the coverage of the short fragments provided by operation 730. These implementations combine the results of passage 1 and passage 2 using normal multivariate models. In some implementations to assess CNV, the probability of ploidy is a probability of aneuploidy, which is a probability of a model having an aneuploid assumption (for example, trisomy or monosomy) minus the probability of a model having an euploid assumption in which the The model uses t-statistics based on coverage of total fragments, fetal fraction estimation and t-statistics based on coverage of short fragments as an input and provides a probability as an input.
[00292] In some implementations, the ploidy probability is expressed as a probability ratio. In some implementations, the probability ratio is modeled as: where pi represents the probability that the data originates from a multivariate normal distribution representing a model of 3 copies or 1 copy, po represents the probability that the data originates from a multivariate normal distribution representing a two-copy model, TCUrto and Ttotai are T counts calculated from the chromosomal coverage generated from short and total fragments, while q (fftotai) is the density distribution of the fetal fraction (estimated from the data training) considering the error associated with the fetal fraction estimate. The model combines the coverage generated from short fragments with the coverage generated by the total fragments, which help to improve the separation between the coverage counts of the affected and non-affected samples. In the described modality, the model also makes use of the fetal fraction and thus also improves the ability to discriminate between affected and unaffected samples. Here, the probability ratio is calculated using the t-statistic based on the coverage of total fragments (726), t-statistic based on the coverage of short fragments (730) and an estimate of the fetal fraction provided by processes 800 (or block 726) , 900 or 1000 as described above. In some implementations, this probability ratio is used to analyze chromosomes 13, 18 and 21.
[00293] Some implementations, a ploidy probability obtained by operation 73 uses only the t-statistics obtained based on the relative short fragment frequency provided by operation 734 of passage 3 and the fetal fraction estimate provided by operation 726, processes 800, 900 or 1000. The probability ratio can be calculated according to the following equation: where pi represents the probability that the data originates from a normal multivariate distribution representing a 3-copy or 1-copy model, po represents the probability that the data originate from a normal multivariate distribution representing a two-copy model, Tcurto-freq is a T count calculated from the relative frequency of short fragments, while aq (fftotal) is the distribution of the density of the fetal fraction (estimated from the training data) considering the error associated with the fetal fraction estimate. Here, the probability ratio is calculated using the t-statistic based on the relative short fragment frequency (734) and an estimated fetal fraction provided by processes 800 (or block 726), 900 or 1000 as described above. In some implementations, this probability ratio is used to analyze the X chromosome.
[00294] In some implementations, the probability ratio is calculated using the t statistic based on the coverage of total fragments (726), t statistics based on the coverage of short fragments (730) and relative frequency of short fragments (734). In addition, the fetal fraction obtained as described above can be combined with t statistics to calculate the probability reaction. By combining information from any of the three passages 713A, 713B and 713C, the discriminative ability of ploidy assessment can be improved. See, for example, Example 2 and Figure 12. In some implementations, different combinations can be used to obtain probability ratios for a chromosome, for example, t statistics for all three passages, t statistics for the first and second steps, fetal fraction and three t statistics, fetal fraction and one t statistic, etc. Then, an optimal combination can be selected based on the performance of the models.
[00295] In some implementations to evaluate autosomes, the modeled probability ratio represents the probability that the modeled data was obtained from a trisomy or monosomy sample with respect to the probability of the modeled data having been obtained from a diploid sample. Such a probability ratio can be used to determine trisomy or monosomy of autosomes in some implementations.
[00296] In some implementations to evaluate the sex chromosome, the probability ratio for monosomy X and the probability ratio for trisomy X are evaluated. In addition, a measurement of chromosome a coverage (for example, CNV or z coverage count) for the X chromosome and one for the Y chromosome are also evaluated. In some implementations, the four values are evaluated using a decision tree to determine the number of copies of the sex chromosome. In some implementations, the decision tree allows the determination of a ploidy case of XX, XY, X, XXY, XXX or XYY.
[00297] In some implementations, the probability ratio is transformed into a log probability ratio and a criterion or threshold for calling an aneuploidy or a variation in the number of copies can be empirically adjusted to obtain a particular sensitivity and selectivity. For example, a log probability ratio of 1.5 can be determined to call a trisomy 13 or a trisomy 18 based on a model's sensitivity and selectivity when applied to a training set. In addition, for example, a call value criterion of 3 can be determined for a chromosome 21 trisomy in some applications.
Details of an exemplary process for determining sequence coverage [00298] Figure 3A shows an example of a process 301 for reducing noise in the sequence data of a test sample. Figures 3B-3J show data analysis at various stages of the process. This provides an example of a process flow that can be used in a multi-step process as described in Figure 2D.
[00299] Process 301 illustrated in Figure 3A uses sequence label coverage based on the number of sequence labels to evaluate the number of copies. However, similar to the description above with respect to process 100 for determining CNV with reference to Figure 1, other variables or parameters, such as size, size ratio and methylation level, can be used instead of coverage for process 400. In some implementations, two or more variables may separately undergo the same process to derive two counts indicative of CNV probability, as shown above with reference to Figure 2D. Then the two counts can be combined to determine a CNV. In addition, coverage and other parameters can be weighed based on the size of the fragments from which the labels are derived. For ease of reading, only the coverage is indicated in process 300, but it should be noted that other parameters, such as size, size ratio and methylation level, size weighted count, etc. can be used in place of the cover.
[00300] As shown in Figure 3A, the described process begins with the extraction of cfDNA from one or more samples. See block 303. Suitable extraction processes and devices are described elsewhere here. In some embodiments, a process described in US Patent Application No. 61 / 801,126, filed on March 15, 2013 (incorporated herein by reference in its entirety) extracts cfDNA. In some implementations, the device processes the cfDNA of multiple samples joined together to provide multiplexed libraries and sequence data. See blocks 305 and 307 in Figure 3A. In some embodiments, the device processes the cfDNA from eight or more test samples in parallel. As described elsewhere here, a sequencing system can process the extracted cfDNA to produce a library of encoded cfDNA fragments (e.g., barcode). A sequence library, sequences cfDNA to produce a very large number of sequence readings. By sample, the coding allows to demultiplex the readings in the multiplexed samples. Each of the eight or more samples can have hundreds of thousands or millions of readings. The process can filter readings before additional operations in Figure 3A. In some modalities, reading filtering is a qualitative filtering process allowed by the software programs implemented in the sequencer to filter incorrect and low quality readings. For example, Illumina's Sequencing Control Software (SCS) and Sequence and Variation Consensus Assessment Software filter out poor, low-quality readings by converting raw image data generated by sequencing reactions in intensity counts, base calls, quality count alignments and additional formats to provide biologically relevant information for downstream analysis.
[00301] After the sequencer or other device generates the readings for a sample, an element of the computer system aligns the readings to a reference genome. See block 309. The alignment is described here elsewhere. The alignment produces labels, which contain reading strings with annotated location information that specify unique positions in the reference genome. In some implementations, the system conducts a first pass alignment unrelated to duplicate readings - two or more readings having identical sequences - and subsequently removes duplicate readings or duplicate readings from the counts as a single reading to produce sequence labels not duplicated. In other implementations, the system does not remove duplicate readings. In some embodiments, the process removes readings that are aligned to multiple sites in the genome from consideration to produce uniquely aligned labels. In some embodiments, non-redundant uniquely aligned sequence labels mapped to non-excluded sites (NESs) are taken into account to produce non-excluded site counts (NES counts), which provide data to estimate coverage.
[00302] As explained elsewhere, excluded sites are sites found in regions of a reference genome that have been excluded for the purpose of counting sequence labels. In some embodiments, excluded sites are found in regions of chromosomes that contain repetitive sequences, for example, centromeres and telomeres and regions of chromosomes that are common to more than one chromosome, for example, regions present on the Y chromosome that are also present on the X chromosome. Non-excluded sites (NESs) are sites that are not excluded in a reference genome for the purpose of counting sequence labels.
[00303] Then, the system divides the labels aligned in bins in the reference genome. See block 311. The bins are spaced along the length of the reference genome. In some embodiments, the entire reference genome is divided into contiguous bins, which can be of equal defined size (for example, 100 kb). Alternatively, bins can be dynamically determined in length, possibly on a per sample basis. The sequencing depth communicates an optimal bin size selection. Dynamically measured bins can be sized according to the size of the library. For example, the size of the bin can be determined to be the length of the string needed to accommodate 1000 labels, on average.
[00304] Each bin has several labels of a sample under consideration. This number of labels, which reflects the “coverage” of the aligned sequence, serves as a starting point for filtering and otherwise cleaning the sample data to reliably determine the variation in the number of copies in the sample. Figure 3A shows the cleaning operations in blocks 313 to 321.
[00305] In the modality described in Figure 3A, the process is applied to a mask to the reference genome bins. See block 313. The system can exclude coverage in masked bins from consideration in some or all of the following process operations. In many cases, the coverage values of the masked bins are not considered any of the remaining operations in Figure 3A.
[00306] In the various implementations, one or more masks are applied to remove the bins for regions of the genome found to present high variability from sample to sample. Such masks are provided for both chromosomes of interest (for example, chr13, 18 and 21) and another chromosome. As explained elsewhere, a chromosome of interest is the chromosome under consideration as potentially harboring a variation in the number of copies or another aberration.
[00307] In some implementations, masks are identified from a training set of qualified samples using the following method. Initially, each training set sample is processed and filtered according to operations 315 to 319 in Figure 3A. The normalized and corrected coverage amounts are then noted for each bin and statistics such as standard deviation, absolute mean deviation, and / or coefficient of variation are calculated for each bin. Various filter combinations can be evaluated for each chromosome of interest. The filter combinations provide a filter for the bins of the chromosome of interest and a different filter for the bins of all other chromosomes.
[00308] In some implementations, the choice of a normalization chromosome (or group of chromosomes) is reconsidered after obtaining the masks (for example, choosing the cuts for a chromosome of interest as described above). After applying the sequence mask, the process of choosing a normalization chromosome or chromosomes can be conducted as described elsewhere here. For example, all possible combinations of chromosomes are evaluated as normalization chromosomes and classified according to their ability to discriminate between affected and unaffected samples. This process can (or cannot) find a different optimal chromosome or group of normalization chromosomes. In other embodiments, normalization chromosomes are those that result in the least viability in the sequence dose for the sequence of interest across all qualified samples. If a chromosome or group of chromosomes of different normalization is identified, the process optionally performs the above-described identification of the filter bins. Possibly, the new normalized chromosomes result in different cuts.
[00309] In some embodiments, a different mask is applied to a Y chromosome. An example of a suitable Y chromosome mask is described in US Provisional Patent Application No. 61 / 836,057, filed on June 17, 2013 [certificate from representative No. ARTEP00811, which is incorporated by reference for all purposes. [00310] After the system computationally masks the bins, it computationally normalizes the coverage values in the bins that are not excluded by the masks. See block 315. In some embodiments, the system normalizes the test sample coverage values in each bin (for example, NES counts per bin) against most or all of the coverage in the reference genome or a portion of it (for example, example, the coverage on the robust chromosomes of the reference genome). In some cases, the system normalizes the values of the test sample coverage (per bin) by dividing the count to bin under consideration by the total number of all non-excluded sites aligned to all robust chromosomes in the reference genome. In some modalities, the system normalizes the values of the coverage of the test sample (per bin) by performing a linear regression. For example, the system first calculates the coverage for a subset of bins on the robust chromosomes like ya = intercept + slope * gwpa, where ya is the coverage for bin a and gwpa is the overall profile for the same bin. The system then calculates the standardized zb coverings as: zb = yb (intercept + slope * gwpb) - 1.
[00311] As explained above, a robust chromosome is one that is likely to be aneuploid. In some embodiments, the robust chromosomes are all autosomal chromosomes other than chromosomes 13, 18 and 21. In some embodiments, the robust chromosomes are all autosomal chromosomes other than the determined chromosomes deviating from a normal diploid genome.
[00312] A transformed bin count value or coverage is indicated as a "normalized coverage amount" for further processing. Normalization is performed using information unique to each sample. Typically, no information from a training set was used. Standardization allows sample coverage amounts having different library sizes (and therefore different reading numbers and labels) to be treated equally. Some of the subsequent process operations use the amounts of coverage derived from the training samples that can be sequenced from the libraries that are larger or smaller than the libraries used for a test sample under consideration. Without normalization based on the number of readings aligned to the entire reference genome (or at least the robust chromosomes), treatment using parameters derived from a training set may not be reliable or generalizable in some implementations.
[00313] Figure 3B illustrates coverage through chromosomes 21, 13 and 18 for many samples. Some of the samples were processed differently from each other. As a consequence, a wide sample-to-sample variation can be seen in any given genomic position. Normalization removes some of the sample-to-sample variation. The left panel of Figure 3C describes the normalized coverage amounts across an entire genome.
[00314] In the modality of Figure 3A, the system removes or reduces a “global profile” of the normalized coverage quantities produced in operation 315. See block 317. This operation removes the systematic trends in their normalized coverage quantities, rising to genome structure, the library generation process and the sequencing process. In addition, this operation is designed to correct any systematic linear deviation from the expected profile in any given sample.
[00315] In some implementations, removing the global profile involves dividing the normalized coverage amount of each bin by a corresponding expected value of each bin. In other modalities, removing the global profile involves subtracting an expected value from each bin from the normalized coverage amount of each bin. The expected value can be obtained from a training set of unaffected samples (or female samples not affected for the X chromosome). Unaffected samples are samples from known individuals that do not have an aneuploidy for the chromosome of interest. In some implementations, removing the global profile involves subtracting the expected value of each bin (obtained from a training set) from the amount of normalized coverage for each bin. In some embodiments, the process uses average values of normalized coverage amounts for each bin as determined using the training set. In other words, the average values are the expected values. [00316] In some modalities, the removal of the global profile is implemented using a linear correction as to the dependence of the sample coverage on the global profile. As indicated, the overall profile is an expected value for each bin as determined from the training set (for example, the average value for each bin). These modalities can use a robust linear model obtained by adjusting the coverage amounts of the standardized test samples against the average global profile obtained for each bin. In some modalities, the linear model is obtained by returning the amounts of normalized coverage observed in the sample against the average global profile (or other expectation value).
[00317] The linear model is based on the hypothesis that the sample coverage quantities have a linear relationship with the global profile values, whose linear relationship must maintain both robust chromosomes / regions and a sequence of interest. See Figure 3D. In such a case, a regression of the sample coverage amounts normalized to the expected coverage quantities in the overall profile will produce a line having a slope and intercept. In some embodiments, the slope and intercept of such a line is used to calculate a “predicted” amount of coverage from the global profile value for a bin. In some implementations, a correction of the global profile involves modeling each amount of normalized bin coverage by the predicted amounts of coverage for the bin. In some implementations, the coverage of the test sequence labels is adjusted by: (i) obtaining a mathematical relationship between the coverage of the test sequence labels against the expected coverage in a plurality of bins in one or more chromosomes or robust regions and (ii) apply the mathematical relationship to the bins in the sequence of interest. In some implementations, the coverage in a test sample is corrected for variation using a linear relationship between the expected coverage values of the unaffected training samples and the coverage values for the test sample on robust chromosomes or other robust regions of the genome. The adjustment results in coverage corrected by the global profile. In some cases, the adjustment involves obtaining coverage for a test sample for a subset of bins on chromosomes or robust regions as follows: ya = intercept + slope * gwpa where ya is the bin a coverage for the test sample in one or plus chromosomes or robust regions and gwpa is the global profile for bin a for unaffected training samples. The process then computes a coverage zb corrected by the global profile for a sequence or region of interest such as: zb = yb / (intercept + slope * gwpb) - 1 where yb is the observed coverage of bin b for the test sample in the sequence of interest (which may reside outside a chromosome or robust region) and gwpb is the global profile for bin b for unaffected samples and training. The denominator (intercept + slope * gwpb) is the coverage for bin b that is predicted to be observed in the unaffected test samples based on the estimated ratio of the robust regions of the genome. In the case of a sequence of interest harboring a variation in the number of copies, the observed coverage and consequently the coverage value corrected by the global profile for bin b will deviate significantly from the coverage of an unaffected sample. For example, the corrected zb coverage would be proportional to the fetal fraction in the case of the trisomic sample for bins on the affected chromosome. This process normalizes within the sample by computing the intercept and slope on the robust chromosomes and then assesses how the genomic region of interest deviates from a relationship (as described by the slope and the interception) that keeps the chromosomes robust within the same sample.
[00318] The slope and intercept are obtained from a line as shown in Figure 3D. An example of removing the global profile is described in Figure 3C. The left panels show a high variation from bin to bin in normalized coverage amounts across many samples. The right panel shows the same normal coverage amounts after removing the overall profile as described above.
[00319] After the system removes or reduces variations in the overall profile in block 317, it corrects variations in the GC content in the sample (guanine-cytosine). See block 319. Each bin has its own fractional contribution from the GC. The fraction is determined by dividing the number of nucleotides G and C in a bin by the total number of nucleotides in a bin (for example, 100,000). Some bins will have higher GC fractions than others. As shown in Figures 3E and 3F, different samples show different GC trends. These differences and their corrections will be further explained below. Figures 3E-G show the corrected global profile, amount of normalized coverage (per bin) as a function of GC fraction (per bin). Surprisingly, different samples have different GC dependencies. Some samples show monotonically decreasing dependence (as in Figure 3E), while others show a comma-shaped dependency (as in Figures 3F and 3G). Because these profiles can be unique for each sample, the correction described in this step is carried out separately and unique for each sample. [00320] In some embodiments, the system computes the bins computationally at the base of the GC fraction as shown in Figures 3E-G. This corrects the corrected global profile, normalized coverage amount of a bin using information from other bins with similar GC levels. This correction is applied to each unmasked bin.
[00321] In some processes, each bin is corrected for GC content as follows. The system computes the bins computationally having GC fractions similar to those of a bin under consideration and then determines a parameter for correcting the information in the selected bins. In some modalities, those bins having similar GC fractions are selected using an arbitrarily defined cutoff value of similarity. In one example, 2% of all bins are selected. These bins are the 2% having bins with GC content more similar to the bin under consideration. For example, 1% of bins having slightly more GC content and 1% having slightly less GC content are selected.
[00322] Using the selected bins, the system computes a correction parameter computationally. In one example, the correction parameter is a value representative of the normalized coverage amounts (after removing the global profile) for the selected bins. Examples of such representative values include the average or averages of normalized coverage amounts for the selected bins. The system applies a correction parameter calculated for a bin under consideration to the amount of standardized coverage (after removing the overall profile) for the under consideration. In some implementations, a representative value (for example, average value) is subtracted from the normalized coverage amount of the bin under consideration. In some embodiments, the average value (or other representative value) of the standardized coverage amounts is selected using only the coverage amounts for the robust autosomal chromosomes (all autosomes other than chromosomes 13, 18 and 21).
[00323] In an example using, for example, 100 kb bins, each bin will have a unique GC fraction value and the bins are divided into groups based on their GC fraction content. For example, bins are divided into 50 groups, where the group limits correspond to (0, 2, 4, 6, ... and 100) quantiles of the GC distribution in%. An average normalized coverage amount is calculated for each group of bins from the mapping of robust autosomes to the same GC group (in the sample) and then the average value is subtracted from the normalized coverage quantities (for all bins across the entire genome in the same group) GC). This applies to an estimated GC correction of robust chromosomes within any sample given to potentially affected chromosomes within the same sample. For example, all the bins on the robust chromosomes having a GC content between 0.338660 and 0.344420 are grouped together, the average is calculated for this group and is subtracted from the normal coverage of bins within this GC range, whose bins can be found anywhere in the genome (excluding chromosomes 13, 18, 21 and X). In some modalities, the Y chromosome is excluded from this process and GC correction.
[00324] Figure 3G shows the application of a GC correction using normalized average coverage amounts as a correction parameter as just described. The left panel shows the amounts of coverage not corrected against the GC fraction profile. As shown, the profile has a non-linear shape. The panel on the right shows the corrected coverage amounts. Figure 3H shows the normalized coverage for many samples before correcting the GC fraction (left panel) and then correcting the GC fraction (right panel). Figure 31 shows the coefficient of variation (CV) of the normalized coverage for many test samples before the correction of the fraction of (red) and after the correction of the fraction of GC (green), where the correction of GC leads to substantially less variations in standardized coverings.
[00325] The above process is a relatively simple implementation of GC correction. Alternative methods for correcting for GC trends use a spline or other nonlinear adjustment technique, which can be applied in the continuous GC space and does not involve saving in quantities of coverage by GC content. Examples of suitable techniques include continuous lowess correction and stable spline correction. An adjustment function can be derived from the amount of normalized coverage from bin to bin versus GC content for the sample under consideration. The correction for each bin is calculated by applying the GC content to the bin under consideration for the adjustment function. For example, the amount of normalized coverage can be adjusted by subtracting the expected coverage value of a spline from the GC content of the bin under consideration. Alternatively, the adjustment can be obtained by dividing the expected coverage value according to the spline adjustment.
[00326] After correcting the GC dependency in operation 319, the computer system removes strange bins in the sample under consideration - See block 321. This operation can be referred to as filtering or cutting a single sample. Figure 3J shows that even after the GC correction, the coverage still has specific sample variation within small regions. See for example the coverage at position 1.1 and 8 on chromosome 12 where an unexpectedly high deviation from the expected value results. It is possible that this deviation results from a small variation in the number of copies in the genome material. Alternatively, this may be due to technical reasons in the sequencing unrelated to the variation in the number of copies. Typically, this operation is only applied to robust chromosomes. [00327] As an example, systems computationally filter any bins having a normalized corrected GC coverage amount of more than 3 average absolute deviations from the average normalized corrected GC coverage across all bins on the chromosome housing the bin under consideration for filtering. In one example, the cutoff value is defined as 3 mean absolute deviations adjusted to be compatible with the standard deviation, so in fact the cutoff is 1.4826 * mean absolute deviations from the mean. In some embodiments, this operation is applied to all chromosomes in the sample, including both robust chromosomes and chromosomes suspected of aneuploidy. [00328] In certain implementations, an additional operation that can be characterized as quality control is performed. See block 323. In some modalities, a quality control metric involves detecting whether any potential denominator chromosomes, that is, “normalizing chromosomes” or “robust chromosomes” are aneuploid or otherwise unsuitable for determining whether the sample of test has a variation in the number of copies in a sequence of interest. When the process determines that a robust chromosome is inappropriate, the process may disregard the test sample and make no qualifications. Alternatively, a failure of this QC metric can trigger the use of an alternative set of normalization chromosomes for qualification. In one example, a quality control method compares actual normalized coverage values for robust chromosomes against expectation values for autosomal robust chromosomes. Expectation values can be obtained by fitting a normal multivariate model to the normalized profiles of unaffected training samples, selecting the best model structure according to the probability of the Bayesian data or criteria (for example, the model is selected using the Akaike information criterion or possibly Bayesian information criterion) and setting the best model for use in QC. Normal models of robust chromosomes can be obtained, for example, using a clustering technique that identifies a probability function having a mean and standard deviation for chromosome coverings in normal samples. Of course, other forms of model can be used. The process assesses the probability of normalized coverage observed in any input test sample given the fixed model parameters. It can do this by counting each input test sample with the model to obtain the probability and thereby identify a strange sample set in relation to the normal one. Deviation in the probability of the test sample from that of the training samples may suggest an abnormality in the normalization chromosomes or a sample handling / testing processing artifact that may result in incorrect sample classification. This QC metric can be used to reduce errors in the classification associated with each of these sample artifacts. Figure 3K, right panel, shows on the x-axis the chromosome number and the y-axis shows the normalized chromosome coverage based on a comparison with a QC model obtained as described above.
The graphs show a sample with excessive coverage for chromosome 2 and another sample with excessive coverage for chromosome 20. These samples would be eliminated using the QC metric described here or diverted to use an alternative set of normalization chromosomes. The left panel of Figure 3K shows NCV versus probability for a chromosome.
[00329] The sequence represented in Figure 3A can be used for all bins of all chromosomes in the genome. In some embodiments, a different process is applied to the Y chromosome. To calculate the chromosome or segment dose, NCV, and / or NSV, the corrected standardized coverage amounts (as determined in Figure 3A) of bins on the chromosomes or segments used in the expressions for dose, NCV, and / or NSV are used. See block 325. In some embodiments, an average normalized coverage amount is calculated from all bins on a chromosome of interest, normalizing chromosome, segment of interest, and / or normalizing segment is used to calculate the sequence dose, NCV , and / or NSV as described elsewhere herein.
[00330] In some embodiments, the Y chromosome is treated differently. It can be filtered by masking a set of unique bins for the Y chromosome. In some embodiments, the Y chromosome filter is determined according to the process in US Provisional Patent Application No. 61 / 836,057, previously incorporated by reference. . In some embodiments, the filter masks bins that are smaller than those in the filter on the other chromosomes. For example, the mask of the Y chromosome can filter at the level of 1 kb, while the masks of another chromosome can filter at the level of 100 kb. However, the Y chromosome can be normalized to the same bin size as the other chromosomes (for example, 100 kb).
[00331] In some embodiments, the filtered Y chromosome is normalized as described above in operation 315 of Figure 3A. However, otherwise, the Y chromosome is not correcting yet. Thus, the Y chromosome bins are not subjected to the overall removal profile. Similarly, the Y chromosome bins are not subjected to GC correction or other filtering steps performed below. This is because when the sample is processed, the process does not know whether the sample is male or female. A female sample should have no reading that aligns with the reference Y chromosome.
Creating a Sequence Mask [00332] Some modalities described here use a strategy to filter (or mask) non-discriminant sequence readings in a sequence of interest using sequence masks, which leads to higher signal and lower noise, relative to the values calculated by conventional methods, in the coverage values used for the assessment of CNV. Such masks can be identified by the various techniques. In one embodiment, a mask is identified using a technique illustrated in Figures 4A-4B as explained in more detail below.
[00333] In some implementations, the mask is identified using a training set of representative samples known to have a normal number of copies of the sequence of interest. Masks can be identified using a technique that first normalizes the samples from the training set, then corrects for systematic variation through a sequence range (for example, a profile) and then corrects them for GC variability as described below. Normalization and correction are performed on samples from a training set, not on test samples. The mask is identified once and then applied to many test samples.
[00334] Figure 4A shows a flowchart of a process 400 for creating such a sequence mask, which can be applied to one or more test samples to remove bins in a sequence of interest for consideration in the copy number evaluation. The process 400 illustrated in Figure 4 uses sequence label coverage based on the number of sequence labels to obtain a sequence mask. However, similar to the description above with respect to process 100 for determining CNV with reference to Figure 1, other variables or parameters, such as size, size ratio and methylation level, can be used in addition to or instead of process coverage. 400. In some implementations, a mask is generated for each of two or more parameters. In addition, coverage parameters and others can be weighted based on the size of the fragments from which the labels are derived. For ease of reading, only coverage is referred to in process 400, but it should be noted that other parameters, such as size, size ratio and methylation level, size weighted count, etc. can be used in place of the cover.
[00335] Process 400 begins by providing a training set including sequence readings from a plurality of unaffected training samples. Block 402. The process then aligns the sequence readings from the training set to a reference genome comprising the sequence of interest, thereby providing training sequence labels for the training samples. Block 404. In some modalities, only singularly aligned non-redundant labels mapped to non-excluded sites are used for another analysis. The process involves dividing the reference genome into a plurality of bins and determining for each unaffected training sample a coverage of training sequence labels in each bin for each training sample. Block 406. The process also determines for each bin an expected coverage of the training sequence labels across all training samples. Block 408. In some modalities, the expected coverage of each bin is the median or average through the training samples. The expected coverage constitutes a global profile. The process then adjusts the coverage of the training sequence labels in each bin for each training sample, removing the variation in the global profile, thus obtaining corrected coverage of the global profile of the training sequence labels in the bins for each training sample. . The process then creates a sequence mask comprising unmasked and masked bins across the reference genome. Each masked bin has a distribution characteristic exceeding a masking threshold. The distribution feature is provided for the adjusted coverage of the training sequence labels in the bin through training samples. In some implementations, the masking threshold may refer to the variation observed in normalized coverage within a bin through training samples. Bins with high coefficients of variation or median absolute deviation from normalized coverage through samples can be identified based on an empirical distribution of the respective measures. In some alternative implementations, the masking threshold may refer to the variation observed in normalized coverage within a bin through training samples. Bins with high coefficients of variation or median absolute deviation from normalized coverage through samples can be masked based on an empirical distribution of the respective measures.
[00336] In some implementations, separate cuts to identify masked bins, that is, masking thresholds, are defined for the chromosome of interest and for all other chromosomes. In addition, separate masking thresholds can be defined for each chromosome of interest separately and a single masking threshold for the set of all unaffected chromosomes. As an example, a mask based on a certain masking threshold is defined for chromosome 13 and another masking threshold is used to define a mask for the other chromosomes. Unaffected chromosomes may also have their chromosome-defined masking thresholds.
[00337] Various combinations of masking threshold can be evaluated for each chromosome of interest. The masking threshold combinations provide a mask for the bins of the chromosome of interest and a different mask for the bins of all other chromosomes.
[00338] In one method, a range of values for the coefficient of variation (CV) or measure of sample distribution cuts is defined as percentiles (for example, 95, 96, 97, 98, 99) of the empirical distribution of values of bin CV and these cutoff values are applied to all autosomes excluding the chromosomes of interest. In addition, a range of percentile cutoff values for CV is defined for the empirical CV distribution and these cutoff values are applied to a chromosome of interest (for example, chromosome 21). In some embodiments, the chromosomes of interest are the X chromosome and chromosomes 13, 18 and 21. Of course, other methods can be considered, for example, a separate optimization can be performed for each chromosome. At the same time, the ranges to be optimized in parallel (for example, one range for one chromosome of interest under consideration and another range for all other chromosomes) define a network of CV cut combinations. See Figure 4B. The performance of the system on the training set is assessed using two cuts (one for the normalization chromosomes (or autosomes except the chromosome of interest) and one for the chromosome of interest) and the best combination of performance is chosen for the configuration Final. This combination may be different for each of the chromosomes of interest. In some modalities, performance is assessed in a validation set rather than the training set, that is, cross-validation is used to assess performance.
[00339] In some modalities, the optimized performance to determine the cutoff ranges is the coefficient of variation of chromosome doses (based on an attempted selection of normalization chromosomes). The process selects the combination of cuts that minimize the CV of the chromosome dose (eg ratio) of the chromosome of interest using a currently selected normalization chromosome (or chromosomes). In one method, the process tests the performance of each combination of cuts in the network as follows: (1) apply the combination of cuts to define the masks for all chromosomes and apply these masks to filter the labels of a training set; (2) calculate the normalized coverage using the unaffected sample training set by applying the process in Figure 3A to the filtered labels; (3) determining a representative standardized coverage per chromosome, for example, adding the standardized bin coverage for a chromosome under consideration; (4) calculate the chromosome doses using the standard normalization chromosomes and (5) determine the CVs of the chromosome doses. The process can evaluate the performance of the selected filters by applying them to a set of test samples separated from an original portion of the training set. That is, the process divides the original training set into training and testing subsets. The training subset is used to define the mask cuts as described above. [00340] In alternative modalities, instead of defining the masks based on coverage CVs, the masks can be defined by a distribution of mapping quality records from the alignment results through training samples inside the bins. A quality mapping record reflects the uniqueness with which a reading is mapped to the reference genome. In other words, quality mapping records quantify the likelihood that a reading will be misaligned. A low mapping quality record is the associated low singularity (high probability of misalignment). The singularity accounts for one or more errors in the reading sequence (as generated by the sequencer). A detailed description of the mapping quality records is presented in Li H, Ruan J, Durbin R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18: 1851-8, which is incorporated herein by reference in its entirety. In some implementation, the mapping quality record here is referred to as a MapQ record. Figure 4B shows that the MapQ record has a strong monotonous correlation with CV of the processed coverings. For example, bins with a CV higher than 0.4 almost completely clump on the left side of the graph in Figure 4B, having MapQ records lower than about 4. Therefore, masking bins with small MapQ can produce a very mask similar to a defined masking bins with high CV.
Samples and Sample Processing Samples [00341] Samples that are used to determine a CNV, for example, chromosomal aneuploidies, partial aneuploidies and the like, can include samples taken from any cell, tissue or organ in which the number of copies varies for one or more strings of interest must be determined. Desirably, the samples contain nucleic acids that are present in cells and / or nucleic acids that are present in cells and / or nucleic acids that are "cell-free" (e.g., cfDNA).
[00342] In some embodiments, it is advantageous to obtain cell-free nucleic acids, for example, cell-free DNA (cfDNA). Cell-free nucleic acids, including cell-free DNA, can be obtained by various methods known in the art from biological samples that include but are not limited to plasma, serum and urine (see, for example, Fan et al., Proc Natl Acad Sci 105: 16266-16271 [2008]; Koide et al., Prenatal Diagnosis 25: 604-607 [2005]; Chen et al., Nature Med. 2: 10331035 [1996]; Lo et al., Lancet 350: 485-487 [1997]; Botezatu et al., Clin Chem. 46: 1078-1084, 2000; and Su et al., J Mol. Diagn. 6: 101-107 [2004]). To separate cell-free DNA from cells in a sample, several methods including, but not limited to fractionation, centrifugation (for example, density gradient centrifugation), specific DNA precipitation or high-throughput cell classification and / or other methods separation can be used. Commercially available kits for manual and automated cfDNA separation are available (Roche Diagnostics, Indianapolis, IN, Qiagen, Valencia, CA, Macherey-Nagel, Duren, DE). Biological samples comprising cfDNA were used in assays to determine the presence or absence to determine the presence or absence of chromosomal abnormalities, for example, trisomy 21, by sequencing assays that can detect chromosomal aneuploidies and / or various polymorphisms.
[00343] In several modalities, the cfDNA present in the sample can be enriched specifically or not specifically before use (for example, before preparing the sequencing library). Non-specific enrichment of the sample's DNA refers to the amplification of the total genome of the sample's genomic DNA fragments that can be used to increase the level of the sample's DNA before preparing a cfDNA sequencing library. Non-specific enrichment can be the selective enrichment of one of the two genomes present in a sample that comprises more than one genome. For example, nonspecific enrichment may be selective from the fetal genome in a maternal sample, which can be obtained by known methods to increase the relative proportion of fetal to maternal DNA in a sample.
Alternatively, non-specific enrichment may be the non-selective amplification of both genomes present in the sample. For example, the non-specific amplification can be fetal and maternal DNA in a sample comprising a mixture of DNA from the fetal and maternal genomes. Methods for total genome amplification are known in the art. PCR initiated by degenerate oligonucleotide (DOP), primer extension PCR technique (PEP) and multiple displacement amplification (MDA) are examples of methods of total genome amplification. In some embodiments, the sample comprising the mix of cfDNA from different genomes is not enriched with cfDNA from the genomes present in the mix. In other embodiments, the sample comprising the mixture of cfDNA from different genomes is not specifically enriched for any of the genomes present in the sample.
[00344] The sample comprising the nucleic acid (s) to which the methods described herein typically comprise a biological sample ("test sample"), for example, as described above. In some embodiments, the nucleic acid (s) to be evaluated for one or more CNVs is (are) purified (s) or isolated (s) by any of the well-established methods. known.
Consequently, in certain embodiments, the sample comprises or consists of a purified or isolated polynucleotide or may comprise samples, such as tissue sample, biological fluid sample, cell sample and the like. Suitable biological fluid samples include, but are not limited to, plasma, serum, sweat, tears, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, transcervical lavage, brain fluid, ascites, milk secretions, respiratory, intestinal and genitourinary tract secretions, amniotic fluid, milk and leukophoresis samples. In some embodiments, the sample is a sample that is easily obtainable by non-invasive procedures, for example, blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, saliva or feces. In certain embodiments, the sample is a peripheral blood sample or the plasma and / or serum fractions from a peripheral blood sample. In other embodiments, the biological sample is a biological material taken with a cotton swab or smear, a biopsy specimen or a cell culture. In another embodiment, the sample is a mixture of two or more biological samples, for example, a biological sample can comprise two or more of a biological fluid sample, a tissue sample and a cell culture sample. As used herein, the terms "blood", "plasma" and "serum" expressly include fractions or processed portions thereof. Similarly, when a sample is taken from a biopsy, biological material collected with a swab, smear, etc., the "sample" expressly includes a processed fraction or portion derived from biopsy, biological material collected with a swab, smear, etc. [00346] In certain embodiments, samples may be obtained from sources that include, but are not limited to, samples from different individuals, samples from different stages of development from the same individuals or different, samples from different sick individuals (for example , individuals with cancer or suspected of having a genetic disorder), normal individuals, samples obtained from different stages of a disease in an individual, samples obtained from an individual undergoing different treatments against a disease, samples from individuals subjected to different environmental factors , samples from individuals predisposed to a condition, individual samples with exposure to an infectious disease agent (eg, HIV) and the like.
[00347] In an illustrative but not limiting modality, the sample is a maternal sample that is obtained from a pregnant female, for example, a pregnant woman. In this example, the sample can be analyzed using the methods described here to provide a prenatal diagnosis of potential chromosomal abnormalities in the fetus. The maternal sample can be a tissue sample, a biological fluid sample or a cell sample. A biological fluid includes, as non-limiting examples, blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, washing transcervical, cerebral fluid, ascites, milk, secretions from the respiratory, intestinal and genitourinary tracts and leukophoresis samples.
[00348] In another illustrative but not limiting modality, the maternal sample is a mixture of two or more biological samples, for example, the biological sample can comprise two or more of a biological fluid sample, a tissue sample and a cell culture sample. In some embodiments, the sample is a sample that is easily obtainable by non-invasive procedures, for example, blood, plasma, serum, sweat, tears, sputum, urine, milk, sputum, ear flow, saliva and feces. In some embodiments, the biological sample is a sample of peripheral blood and / or plasma and serum fractions thereof. In other embodiments, the biological sample is a biological material taken with a cotton swab or smear, a biopsy specimen or a sample from a cell culture. As described above, the terms "blood," "plasma" and "serum" expressly cover fractions or processed portions thereof. Similarly, when a sample is taken from a biopsy, biological material collected with a swab, smear, etc., the "sample" expressly includes a processed fraction or portion derived from the biopsy, biological material collected with a swab, smear, etc.
[00349] In certain embodiments, samples can also be obtained from tissues grown in vitro, cells or other sources containing polynucleotides. Cultured samples can be taken from sources that include, but are not limited to, cultures (for example, tissues or cells) maintained in different media and conditions (for example, pH, pressure or temperature), cultures (for example, tissues or cells) maintained for different periods of time, cultures (for example, tissue or cells) treated with different factors or reagents (for example, a candidate drug or a modulator) or cultures of different types of tissues and / or cells.
[00350] Methods of isolating nucleic acids from biological sources are well known and will differ depending on the nature of the source. A person skilled in the art can easily isolate nucleic acid (s) from a source when necessary for the method described here. In some examples, it may be advantageous to fragment the nucleic acid molecules in the nucleic acid sample. Fragmentation can be random or can be specific, when achieved, for example, using restriction endonuclease digestion. Methods for random fragmentation are well known in the art and include, for example, limited DNAse digestion, alkali treatment and physical cutting. In one embodiment, the sample's nucleic acids are obtained from cfDNA, which is not subjected to fragmentation.
Preparation of the Sequencing library [00351] In one embodiment, the methods described here can use next generation sequencing technologies (NGS), allowing multiple samples to be sequenced individually as genomic molecules (ie, singleplex sequencing) or as joined samples comprising indexed genomic molecules (for example, multiplex sequencing) or as a series of simple sequencing. These methods can generate up to several hundred million DNA sequence readings. In various embodiments, the sequences of genomic nucleic acids and / or indexed genomic nucleic acids can be determined using, for example, the Next Generation Sequencing Technologies (NGS) described here. In several modalities, the analysis of the massive amount of sequencing data obtained using NGS can be performed using one or more processors as described here. [00352] In several modalities, the use of such sequencing technologies does not involve the preparation of sequencing libraries. [00353] However, in certain embodiments the sequencing methods considered here involve the preparation of sequencing libraries. In an illustrative method, the preparation of the sequencing library involves producing a random collection of adapter-modified DNA fragments (for example, polynucleotides) that are ready to be sequenced. Polynucleotide sequencing libraries can be prepared from DNA or RNA, including equivalents, DNA or cDNA analogs, for example, DNA or cDNA which is complementary DNA or a copy produced from an RNA standard, by the action of transcriptase reverse. Polynucleotides can originate in the form of double strand (for example, dsDNA, such as fragments of genomic DNA, cDNA, PCR amplification products and others) or, in certain embodiments, polynucleotides can originate in the form of filament simple (eg ssDNA, RNA, etc.) and have been converted to the form of dsDNA. By way of illustration, in certain embodiments, single-stranded mRNA molecules can be copied onto double-stranded cDNAs suitable for use in secondary library preparation. The precise sequence of the primary polynucleotide molecules is generally not material to the library preparation method and may be known or unknown. In one embodiment, polynucleotide molecules are DNA molecules. More particularly, in certain embodiments, polynucleotide molecules represent the total genetic complement of an organism or substantially the total genetic complement of an organism and are genomic DNA molecules (eg, cellular DNA, cell-free DNA (cfDNA), etc.) .), typically including both intron sequence and exon sequence (coding sequence), as well as non-coding regulatory sequences, such as promoter and enhancer sequences. In certain embodiments, the primary polynucleotide molecules comprise human genomic DNA molecules, for example, cfDNA molecules present in the peripheral blood of a pregnant individual. [00354] The preparation of sequencing libraries for some NGS sequencing platforms is facilitated by the use of polynucleotides that comprise a specific range of fragment sizes. The preparation of such libraries typically involves the fragmentation of large polynucleotides (for example, cellular genomic DNA) to obtain polynucleotides in the desired size range.
[00355] Fragmentation can be achieved by any of several methods known to those skilled in the art. For example, fragmentation can be achieved by mechanical means that include, but are not limited to, nebulization, sonification or hydrocutting. However, mechanical fragmentation typically cleaves the main DNA strand into C-0, P-0 and CC bonds resulting in a heterogeneous mixture of abrupt ends and 3 'and 5' projections with broken C-0, P-0 and CC bonds (see, for example, Alnemri and Liwack, J Biol. Chem 265: 1732317333 [1990]; Richards and Boyer, J Mol Biol 11: 327-240 [1965]) that may need to be repaired when they may need 5'-phosphate required for subsequent enzymatic reactions, for example, ligation of sequencing adapters, which are required to prepare DNA for sequencing.
[00356] In contrast, cfDNA typically exists as fragments of less than about 300 base pairs and, consequently, fragmentation is typically not necessary to generate a sequencing library using cfDNA samples.
[00357] Typically, if the polynucleotides are forcibly fragmented (e.g., fragmented in vitro) or exist naturally as fragments, these are converted to abrupt-ended DNA having 5'-phosphates and 3'-hydroxyl. Standard protocols, for example, sequencing protocols using, for example, the Illumina platform as described anywhere in this, instruct final repair sample DNA users to purify final repair products before tail formation dA and to purify the dA tail forming products prior to the adapter binding steps of the library preparation.
[00358] Various modalities of the sequence library preparation methods described here obviate the need to perform one or more of the steps typically ordered by standard protocols to obtain a modified DNA product that can be sequenced by NGS. An abbreviated method (ABB method), a 1-step method and a 2-step method are examples of methods for the preparation of a sequencing library, which can be found in patent application 13 / 555,037 filed on July 20, 2012 , which is incorporated by reference in its entirety. Nucleic Acids Markers for screening and verifying sample integrity [00359] In several modalities, verification of sample integrity and sample tracking can be performed by sequencing mixtures of sample genomic nucleic acids, for example, cfDNA and tracking marker nucleic acids that were introduced into the samples, for example, before processing.
[00360] The marker nucleic acids can be combined with the test sample, (for example, biological source sample) and subjected to processes that include, for example, one or more of the fractionation steps of the biological source sample, for example , obtaining a fraction of plasma essentially free of cells from a whole blood sample, purifying nucleic acids from a plasma, for example, fractionated or sample of unfractionated biological source, for example, a tissue sample and sequencing. In some embodiments, sequencing comprises preparing a sequencing library. The sequence or combination of sequences of the marker molecules that are combined with a sample of choice is chosen to be unique to the source sample. In some embodiments, all of the unique marker molecules in a sample have the same sequence. In other embodiments, the unique marker molecules in a sample are a plurality of sequences, for example, a combination of two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty or more different sequences.
[00361] In one embodiment, the integrity of the sample can be verified using a plurality of marker nucleic acid molecules having identical sequences. Alternatively, the identity of a sample can be verified using a plurality of marker nucleic acid molecules having at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17m, at least 18, at least 19, at least 20, at least 25, at least 30, at least 35, at least 40, at least 50 or more different sequences. The integrity check The integrity check of the plurality of biological samples, that is, two or more biological samples, requires that each of the two or more samples be labeled with marker nucleic acids having the sequences that are unique to each plurality of the sample. test being scored. For example, a first sample can be labeled with a marker nucleic acid having sequence A and a second sample can be labeled with a marker nucleic acid having sequence B. Alternatively, a first sample can be labeled with marker nucleic acid molecules all having sequence A and a second sample can be labeled with a mixture of sequences B and C, where sequences A, B and C are marker molecules having different sequences.
[00362] The marker nucleic acids can be added to the sample at any stage of sample preparation that occurs before library preparation (if libraries are to be prepared) and sequencing. In one embodiment, the marker molecules can be combined with an unprocessed source sample. For example, the marker nucleic acid can be supplied in a collection tube that is used to collect a blood sample. Alternatively, the marker nucleic acids can be added to the blood sample following blood withdrawal. In one embodiment, the marker nucleic acid is added to the container that is used to collect a sample of biological fluid, for example, the marker nucleic acids are added to a blood collection tube that is used to collect a blood sample. In another embodiment, the marker nucleic acids are added to a fraction of the biological fluid sample. For example, the marker nucleic acid is added to the plasma and / or serum fraction of a blood sample, for example, a maternal plasma sample. In yet another embodiment, the marker molecules are added to a purified sample, for example, a nucleic acid sample that has been purified from a biological sample. For example, the marker nucleic acid is added to a sample of purified maternal and fetal cfDNA. Similarly, nucleic acid markers can be added to the biopsy specimen before processing the specimen. In some embodiments, the nucleic acid markers can be combined with a carrier that releases the marker molecules in the cells of the biological sample. Cell release carriers include pH-sensitive liposomes and cationic liposomes.
[00363] In several modalities, the marker molecules have antigenomic sequences, which are sequences that are absent from the genome of the biological source sample. In an exemplary embodiment, the marker molecules are used to verify the integrity of a sample of human biological source that have sequences that are absent from the human genome. In an alternative embodiment, the marker molecules have sequences that are absent from the source sample and any one or more other known genomes. For example, the marker molecules that are used to verify the integrity of the human biological source sample have sequences that are missing from the human genome and the mouse genome. The alternative allows to verify the integrity of a test sample that comprises two or more genomes. For example, the integrity of a human DNA sample free of cells obtained from an individual affected by a pathogen, for example, a bacterium, can be verified using marker molecules having sequences that are absent from both the human genome and the genome that affects bacteria. The genome sequences of numerous pathogens, for example, bacteria, viruses, yeasts, fungi, protozoa, etc., are publicly available on the Internet at ncbi.nlm.nih.gov/genomes. In another embodiment, the marker molecules are nucleic acids having sequences that are absent from any known genome. Sequences of marker molecules can be randomly generated in an algorithmic manner.
[00364] In several embodiments, the marker molecules can be naturally occurring deoxyribonucleic acids (DNA), ribonucleic acids or artificial nucleic acid analogs (nucleic acid imitations) including peptide nucleic acids (PNA), nucleic acid, nucleic acids blocked, glycol nucleic acids and threose nucleic acids, which are distinguished from naturally occurring DNA or RNA by changes to the main chain of the molecule or imitations of DNA that have no phosphodiester main chain. Deoxyribonucleic acids can be from naturally occurring genomes or can be generated in a laboratory through the use of enzymes or by solid phase chemical synthesis. Chemical methods can also be used to generate DNA imitations that are not found in a natural state. DNA derivatives are those that are available and in which the phosphodiester bond has been substituted, but in which deoxyribose is retained include but are not limited to imitations of DNA having thioformacetal or stranded carboxamide linkages, which have been shown to be good imitations of structural DNA. Other DNA imitations include morpholino derivatives and peptide nucleic acids (PNA), containing a pseudopeptide backbone based on N- (2-aminoethyl) glycine (Ann Rev Biophys Biomol Struct 24: 167-183 [1995]) . PNA is an extremely good structural imitation of DNA (or ribonucleic acid [RNA]) and PNA oligomers are capable of forming very stable complex structures with complementary Watson-Crick DNA and RNA (or PNA) oligomers and these can also bind to targets in duplex DNA by helix invasion (Mol Biotechnol 26: 233-248 [2004]. Another good imitation / structural analog of DNA analog that can be used as a marker molecule is phosphorothioate DNA in which one of the oxygen is not bridged is replaced by a sulfur.This modification reduces the action of endo and exonucleases 2 including 5 'to 3' and 3 'to 5' DNA POL 1 exonuclease, Si and P1 nucleases, RNases, serum nucleases and poison phosphodiesterase [00365] The length of the marker molecule may be distinct or indistinguishable from that of the sample nucleic acids, that is, the length of the marker molecules may be similar to that of the genomic molecules in the sample or may be longer. less than that of the genomic molecules in the sample. The length of the marker molecules is measured by the number of nucleotide or nucleotide analog bases that make up the marker molecule. Marker molecules having lengths that differ from those of the sample genomic molecules can be distinguished from the source nucleic acids using separation methods known in the art. For example, differences in the length of the sample marker and nucleic acid molecules in the sample can be determined by electrophoretic separation, for example, capillary electrophoresis. Size differentiation can be advantageous for quantifying and estimating the quality of the sample marker and nucleic acids. Preferably, the nucleic acid markers are shorter than genomic nucleic acids and of sufficient length to exclude them from being mapped in the sample genome. For example, how a 30 base human sequence is needed to map only to a human genome. Consequently, in certain embodiments, the marker molecules used in the sample sequencing bioassays must be at least 30 base pairs in length.
[00366] The choice of the length of the marker molecule is determined primarily by the sequencing technology that is used to verify the integrity of a source sample. The length of the genomic nucleic acids in the sample being sequenced can also be considered. For example, some sequencing technologies use clonal amplification of polynucleotides, which may require that the genomic polynucleotides that are to be clonally amplified are of minimum length. For example, sequencing using the Illumina GAIT sequence analyzer includes in vitro clonal amplification by bridged PCR (also known as group amplification) of polynucleotides having a minimum length of 110 base pairs, to which adapters are attached to provide a nucleic acid of at least 200 base pairs and less than 600 base pairs that can be clonally amplified and sequenced. In some embodiments, the length of the adapter-bound marker molecule is between about 200 base pairs and about 600 base pairs, between about 250 base pairs and 550 base pairs, between about 300 base pairs and 500 base pairs or between about 350 and 450. In other embodiments, the length of the adapter-bound marker molecule is about 200 base pairs. For example, when sequencing the fetal cfDNA that is present in a maternal sample, the length of the marker molecule can be chosen to be similar to that of the fetal cfDNA molecules. Thus, in one embodiment, the length of the marker molecule used in an assay that comprises the massively parallel sequencing of cfDNA in a maternal sample to determine the presence or absence of a fetal chromosomal aneuploidy, can be about 150 base pairs, about 160 base pairs, 170 base pairs, about 180 base pairs, about 190 base pairs or about 200 base pairs; preferably, the marker molecule is about 170 pp. Other sequencing methods, for example, SOLiD sequencing, Polony sequencing and 454 sequencing use the emulsion PCR to clonally amplify the DNA molecules for sequencing and each technology dictates the minimum and maximum length of the molecules that must be amplified. The length of the marker molecules to be sequenced as clonally amplified nucleic acids can be up to about 600 base pairs. In some embodiments, the length of the marker molecules to be sequenced can be greater than 600 base pairs.
[00367] Simple molecule sequencing technologies, which do not use clonal amplification of molecules and are capable of sequencing nucleic acids over a very wide range of standard lengths, in most situations do not require the molecules to be sequenced to be of any length specific. However, the yield of sequences per unit mass is dependent on the number of 3 'end hydroxyl groups and, therefore, having relatively short patterns for sequencing is more efficient than having long patterns. If starting with nucleic acids longer than 1000 nt, it is generally appropriate to cut the nucleic acids to an average length of 100 to 200 nt so that more sequence information can be generated from the same mass of nucleic acids. In this way, the length of the marker molecule can vary from tens of bases to thousands of bases. The length of marker molecules used for single molecule sequencing can be up to about 25 base pairs, up to about 50 base pairs, up to about 75 base pairs, up to about 100 base pairs, up to about 200 base pairs, up to about 300 base pairs, up to about 400 base pairs, up to about 500 base pairs, up to about 600 base pairs, up to about 700 base pairs, up to about 800 base pairs base, up to about 900 base pairs, up to about 1000 base pairs or more in length.
[00368] The length chosen for a marker molecule is also determined by the length of the genomic nucleic acid being sequenced. For example, cfDNA circulates in the human bloodstream like genomic fragments of cellular genomic DNA. The fetal cfDNA molecules observed in the plasma of pregnant women are generally shorter than maternal cfDNA molecules (Chan et al., Clin Chem 50: 8892 [2004]). The size fractionation of circulating fetal DNA confirmed that the average length of circulating fetal DNA fragments is <300 base pairs, while maternal DNA was estimated to be between about 0.5 and 1 Kb (Li et al., Clin Chem, 50: 1002-1011 [2004]). These findings are consistent with those of Fan et al., Who determined using the NGS whose fetal cfDNA is rarely> 340 base pairs (Fan et al., Clin Chem 56: 1279-1286 [2010]). DNA isolated from urine using a standard silica-based method consists of two fractions, high molecular weight DNA, which originates from the spilled cells and low molecular weight fraction (150 to 250 base pairs) of transrenal DNA (Tr -DNA) (Botezatu et al., Clin Chem. 46: 1078-1084, 2000; and Su et al., J Mol. Diagn. 6: 101-107, 2004). The application of a recently developed technique for the isolation of cell-free nucleic acids from body fluids for the isolation of transrenal nucleic acids revealed the presence in the urine of DNA and RNA fragments much shorter than 150 base pairs (Publication of the Patent Application No. 20080139801). In the modalities, where cfDNA is the genomic nucleic acid that is sequenced, the marker molecules that are chosen may be around the length of the cfDNA. For example, the length of the marker molecules used in maternal cfDNA samples to be sequenced as single nucleic acid molecules or as clonally amplified nucleic acids can be between about 100 base pairs and 600. In other embodiments, the genomic nucleic acids of the sample are fragments of larger molecules. For example, a genomic nucleic acid in the sample that is sequenced is fragmented cellular DNA. In the modalities, when the fragmented cellular DNA is sequenced, the length of the marker molecules can be up to the length of the DNA fragments. In some embodiments, the length of the marker molecules is at least the minimum length required to map the sequence reading only to the appropriate reference genome. In other embodiments, the length of the marker molecule is the minimum length that is required to exclude the marker molecule from being mapped to the sample reference genome.
[00369] In addition, the marker molecules can be used to verify that they are not estimated by nucleic acid sequencing and that they can be verified by other common biotechniques other than sequencing, for example, real-time PCR.
Control samples (for example, in positive process controls for sequencing and / or analysis) [00370] In several embodiments, the marker sequences introduced in the samples, for example, as described above, can function as positive controls to verify accuracy and the effectiveness of subsequent sequencing and processing and analysis.
[00371] Consequently, the compositions and method for providing positive process control (IPC) for DNA sequencing in a sample are provided. In certain embodiments, positive controls are provided for sequencing cfDNA in a sample that comprises a mixture of genomes. An IPC can be used to relate baseline changes to the sequence information obtained from different series of samples, for example, samples that are sequenced at different times in different sequencing series. In this way, for example, an IPC can relate the sequence information obtained from a maternal test sample to the sequence information obtained from a series of qualified samples that were sequenced at a different time.
[00372] Similarly, in the case of segment analysis, an IPC may relate the sequence information obtained from an individual to a particular segment (s) to the sequence obtained from a series of qualified samples (of similar sequences) ) that were sequenced at a different time. In certain embodiments, a CPI may relate the sequence information obtained from an individual to particular cancer-related sites to the sequence information obtained from a series of qualified samples (for example, from a known amplification / deletion and similar).
[00373] In addition, IPCs can be used as markers to track samples through the sequencing process. IPCs can also provide qualitative positive sequence dose values, for example, NCV, for one or more aneuploidies of chromosomes of interest, for example, trisomy 21, trisomy 13, trisomy 18 to provide appropriate interpretation and to ensure dependence and data accuracy. In certain embodiments, CPIs can be created to understand nucleic acids from male and female genomes to provide doses for X and Y chromosomes in a maternal sample to determine whether the fetus is male.
[00374] The type and number of controls in process depends on the type or nature of the test required. For example, for a test that requires DNA sequencing from a sample comprising a mixture of genomes to determine whether a chromosomal aneuploidy exists, the in-process control may comprise DNA obtained from a known sample that comprises the same chromosomal aneuploidy being tested. In some embodiments, the CPI includes DNA from a known sample comprising an aneuploidy of a chromosome of interest. For example, the IPC for a test to determine the presence or absence of a fetal trisomy, for example, trisomy 21, in a maternal sample comprises DNA obtained from an individual with trisomy 21. In some embodiments, the IPC comprises a mixture of DNA obtained from two or more individuals with different aneuploidies. For example, for a test to determine the presence or absence of trisomy 13, trisomy 18, trisomy 21 and monosomy X, the IPC comprises a combination of DNA samples obtained from a pregnant woman, each carrying a fetus with one of the trisomies being tested. In addition to complete chromosomal aneuploidies, CPIs can be created to provide positive controls for the tests to determine the presence or absence of partial aneuploidies.
[00375] An IPC that serves as the control to detect a simple aneuploidy can be created using a mixture of cellular genomic DNA obtained from two individuals, one being the contributor to the aneuploidy genome. For example, a CPI that is created as a control for a test to determine a fetal trisomy, for example, trisomy 21, can be created by combining genomic DNA from a male or female individual who carries the trisomic chromosome with genomic DNA with a known female individual does not carry the trisomic chromosome. Genomic DNA can be extracted from cells of both individuals and divided to provide fragments between about 100 and 400 base pairs, between about 150 and 350 base pairs or between about 200 and 300 base pairs to simulate the circulating cfDNA fragments in maternal samples. The proportion of fragmented DNA from the individual carrying the aneuploidy, for example, trisomy 21, is chosen to simulate the proportion of circulating fetal cfDNA found in maternal samples to provide an IPC comprising a fragmented DNA mixture comprising about 5 %, about 10%, about 15%, about 20%, about 25%, about 30% of DNA from the individual who carries the aneuploidy. The CPI can understand DNA from different individuals, each carrying a different aneuploidy. For example, the CPI can comprise about 80% of the unaffected female DNA and the remaining 20% can be DNA from three different individuals, each carrying a trisomic chromosome 21, a trisomic chromosome 13 and a trisomic chromosome 18. The mixture of Fragmented DNA is prepared for sequencing. Processing of the fragmented DNA mixture may comprise the preparation of a sequence library, which can be sequenced using any massively parallel methods in a singleplex or multiplex manner. The stock solutions of the genomic IPC can be stored and used in multiple diagnostic tests.
[00376] Alternatively, the CPI can be created using the cfDNA obtained from a known mother to carry a fetus with a known chromosomal aneuploidy. For example, cfDNA can be obtained from a pregnant woman carrying a fetus with trisomy 21. cfDNA is extracted from a maternal sample and cloned into a bacterial vector and developed in bacteria to provide a continuous source of IPC. DNA can be extracted from the bacterial vector using restriction enzymes. Alternatively, the cloned cfDNA can be amplified, for example, by PCR. The IPC DNA can be processed for sequencing in the same series as the cfDNA of the test samples that must be analyzed for the presence or absence of chromosomal aneuploidies.
[00377] While the creation of CPIs is described above with respect to trisomies, it will be estimated that CPIs can be created to reflect other partial aneuploidies including, for example, various amplifications and / or segment cancellations. In this way, for example, where several cancers are known, they must be associated with particular amplifications (for example, breast cancer associated with 20Q13) the CPIs can be created by incorporating those known amplifications. Sequencing Methods [00378] As indicated above, prepared samples (for example, Sequencing Libraries) are sequenced as part of the procedure to identify the copy number variation (s). Any of several sequencing technologies can be used.
[00379] Some sequencing technologies are commercially available, such as the Affymetrix Inc. hybridization sequencing platform (Sunnyvale, CA) and 454 Life Sciences (Bradford, CT), Illumina / Solexa (Hayward) synthesis sequencing platforms , CA) and Helicos Biosciences (Cambridge, MA) and the Applied Biosystems link sequencing platform (Foster City, CA), as described below. In addition to single molecule sequencing using Helicos Biosciences synthesis sequencing, other technologies and simple molecule sequencing include, but are not limited to Pacific Biosciences SMRT® technology, ION TORRENT 'technology and developed nanopore sequencing, by example, by Oxford Nanopore Technologies. [00380] While the automated Sanger method is considered a 'first generation' technology, Sanger sequencing including automated Sanger sequencing, can also be used in the methods described here. Additional suitable sequencing methods include, but are not limited to, nucleic acid imaging technologies, for example, atomic force microscopy (AFM) or transmission electron microscopy (TEM). Illustrative sequencing technologies are described in more detail below.
[00381] In an illustrative but not limiting modality, the methods described here comprise obtaining sequence information for nucleic acids in a test sample, for example, cfDNA in a maternal sample, cfDNA or cellular DNA in an individual being evaluated for cancer and others, using Illumina synthesis sequencing and reversible terminator based sequencing chemistry (for example, as described in Bentley et al., Nature 6: 53-59 [2009]). The standard DNA can be genomic DNA, for example, cellular DNA or cfDNA. In some embodiments, the genomic DNA of the isolated cells is used as the standard and is fragmented into lengths of several hundred base pairs. In other embodiments, cfDNA is used as the standard and fragmentation is not required as cfDNA exists as short fragments. For example, fetal cfDNA circulates in the bloodstream as fragments approximately 170 base pairs (base pairs) in length (Fan et al., Clin Chem 56: 1279-1286 [2010]) and no DNA fragmentation is required before sequencing. Illumina's sequencing technologies rely on the binding of fragmented genomic DNA to an optically transparent, flat surface to which the oligonucleotide anchors are attached. Standard DNA is repaired at the end to generate abrupt 5'-phosphorylated ends and the Klenow fragment polymerase activity is used to add a simple base A to the 3 'end of the abrupt phosphorylated DNA fragments. This addition prepares the DNA fragments for binding to the oligonucleotide adapters, having a projection of a simple T base at its 3 'end to increase binding efficiency. Adapter oligonucleotides are complementary to anchored flow cell oligos (not to be confused with anchor / anchored readings in the analysis of repeat expansion). Under limiting dilution conditions, standard adapter-modified single-stranded DNA is added to the flow cell and immobilized by hybridization to the anchor oligos. The linked DNA fragments are stretched and amplified by bridge to create a sequentially ultra-high density flow cell with hundreds of millions of groups, each containing about 1,000 copies of the same pattern. In one embodiment, the randomly fragmented genomic DNA is amplified using PCR before being subjected to group amplification. Alternatively, an amplification-free genomic library preparation (eg, PCR-free) is used and randomly fragmented genomic DNA is enriched using group amplification only (Kozarewa et al., Nature Methods 6: 291-295 [ 2009]). Patterns are sequenced using robust four-color DNA synthesis sequential technology that uses reversible terminators with removable fluorescent pigments. High sensitivity fluorescence detection is achieved using laser excitation and total internal reflection optics. Short sequence readings of about ten to a few hundred base pairs are aligned against a reference genome and the unique mapping of short sequence readings to the reference genome is identified using specially developed data analysis pipeline software . After completing the first reading, the patterns can be regenerated in situ to allow a second reading of the opposite end of the fragments. In this way, single-ended or paired-end sequencing of the DNA fragments can be used.
[00382] Several modalities of the description can use sequencing by synthesis that allows the sequencing of end in pairs. In some embodiments, sequencing by the Illumina synthesis platform involves cluster fragments. Clustering is a process in which each fragment molecule is isothermally amplified. In some embodiments, such as the examples described here, the fragment has two different adapters attached to the two ends of the fragment, the adapters allowing the fragment to hybridize to the two different oligos on the surface of a flow cell line. The fragment further includes or is connected to two index strings at two ends of the fragment, whose index strings provide labels for identifying different samples in multiplex sequencing. On some sequencing platforms, a fragment to be sequenced is also referred to as an insert.
[00383] In some implementation, a flow cell for grouping on the Illumina platform is a glass slide with lines. Each line is a glass channel lined with a fabric of two types of oligos. Hybridization is permitted by the first of the two types of oligos on the surface. This oligo is complementary to a first adapter at one end of the fragment. A polymerase creates a complementary strand of the hybridized fragment. The double-stranded molecule is denatured and the original standard filament is washed away. The remaining filament, in parallel with many other remaining filaments, is clonally amplified through bridging.
[00384] In bridge amplification, the filament folds into a second adapter region at a second end of the filament hybridizes to the second type of oligos on the flow cell surface. A polymerase generates a complementary filament, which forms a double-stranded bridge molecule. This double-stranded molecule is denatured resulting in two double-stranded molecules attached to the flow cell through two different oligos. The process is then repeated several times and occurs simultaneously by millions of groups resulting in the clonal amplification of all fragments. After bridge amplification, the reverse filaments are cleaved and washed, leaving only the advanced filaments. The 3 'ends are blocked to prevent unwanted initiation.
[00385] After grouping, the sequencing starts with the extension of a first sequencing initiator to generate the first reading. With each cycle, the fluorescently labeled nucleotides compete for addition to the chain's development. Only one is incorporated based on the pattern sequence. After the addition of each nucleotide, the group is stimulated by a light source and a characteristic fluorescent signal is emitted. The number of cycles determines the length of the reading. The emission wavelength and signal strength determine the base call. For a given group, all identical filaments are read simultaneously. Hundreds of millions of groups are sequenced in a massively parallel manner. At the end of the first reading, the reading product is washed.
[00386] In the next step of protocols involving two index primers, an index 1 primer is introduced and hybridized to an index 1 region in the standard. The index regions provide the identification of fragments, being useful for demultiplexing the samples in a multiplex sequencing process. The index reading 1 is generated similar to the first reading. After the index 1 reading is finished, the reading product is washed and the 3 'end of the filament is unprotected. The standard filament then folds and attaches to a second oligo in the flow cell. An index 2 sequence is read in the same way as in index 1. Then a product of the index 2 reading is washed at the end of the step.
[00387] After reading the two indices, reading 2 starts using the polymerase to extend the second flow cell oligo, forming a double filament bridge. This double stranded DNA is denatured and the 3 'end is blocked. The original advanced filament is cleaved and washed, leaving the reverse filament. Reading 2 starts with the introduction of a reading 2 sequencing initiator. Like reading 1, the sequencing steps are repeated until the desired length is reached. Reading product 2 is washed. This total process generates millions of readings, representing all fragments. The sequences of joined sample libraries are separated based on the unique indices introduced during sample preparation. For each sample, readings of similar stretches of basic calls are grouped locally. Forward and reverse readings are contiguous sequences of creation in pairs. These contiguous sequences are aligned to the reference genome for variant identification.
[00388] Sequencing by the synthesis example described above involves the final readings in pairs, being used in many of the modalities of the described methods. Pairwise end sequencing involves 2 readings from two ends of a fragment. When a pair of readings is mapped to a reference sequence, the distance from base pairs between the two readings can be determined, the distance of which can then be used to determine the length of the fragments from which the readings were obtained. In some instances, a fragment spanning its boxes must have one of its end readings in pairs aligned to a box and another to an adjacent box. This gets rarer as the boxes get longer or the readings get shorter. Various methods can be used to explain the association of the box of these fragments. For example, these can be omitted in determining the frequency of the fragment size of a box; these can be counted for both adjacent boxes; these can be assigned to the box that covers the largest number of base pairs in the two boxes or can be assigned to both boxes with a weight related to the portion of base pairs in each box.
[00389] End reading in pairs can use insertion of different length (ie different fragment size to be sequenced). As the default meaning in this description, paired end readings are used to refer to readings obtained from various insertion lengths. In some instances, to distinguish the short readings from the short insert pairs from the long readings in the long insert pairs, the latter also referred to as readings in partner pairs. In some modalities that involve partner pair readings, two biotin junction adapters are first connected to two ends of a relatively long insert (for example, several kb). The biotin junction adapters then connect the two ends of the insert to form a circled molecule. A subfragment that comprises the biotin junction adapters can be obtained by further fragmenting the circulating molecule. The subfragment including the two ends of the original fragment in the opposite sequence order can then be sequenced by the same processing for the short insert pair end sequencing described above. The additional details of sequencing partner pairs using an Illumina platform are shown in the online publication at the following URL, which is incorporated by reference in its entirety: res |. | illuminates |. | com / documents / products / technotes / technote_nextera_matepair_ data_processing. Additional information on pair-end sequencing can be seen in US Patent No. 7601499 and US Patent Publication No. 2012 / 0.053,063, which are incorporated by reference with respect to materials in pair-end sequencing methods and mechanisms.
[00390] After the sequencing of DNA fragments, the sequence readings of predetermined length, for example, 100 base pairs, are mapped or aligned and a known reference genome. Mapped or aligned readings and their corresponding locations in the reference sequence are also referred to as labels. In one embodiment, the reference genome sequence is the NCBI36 / hg18 sequence, which is available on the internet at genome dot ucsc dot edu / cgi-bin / hgGateway Org = Human & db = hg18 & hgsid = 166260105).
Alternatively, the reference genome sequence is GRCh37 / hg19, which is available on the internet at genome dot ucsc dot edu / cgi-bin / hgGateway. Other sources of publicly available information include GenBank, dbEST, dbSTS, EMBL (the European Molecular Biology Laboratory) and DDBJ (the DNA Databank of Japan). Several computer algorithms are available to align sequences, including without limitation BLAST (Altschul et al., 1990), BLITZ (MPsrch) (Sturrock & Collins, 1993), FASTA (Person & Lipman, 1988), BOWTIE (Langmead et al. , Genome Biology 10: R25.1-R25.10 [2009]) or ELAND (Illumina, Inc., San Diego, CA, USA). In one embodiment, one end of the clonally expended copies of the plasma cfDNA molecules is sequenced and processed by bioinformatics alignment analysis for Illumina Genome Analyzer, which uses the Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) software.
[00391] In an illustrative but not limiting modality, the methods described here comprise obtaining sequence information for nucleic acids in a test sample, for example, cfDNA in a maternal sample, cfDNA or cellular DNA in an individual being evaluated for to a cancer and the like, using the Helicos True Single Molecule Sequencing (tSMS) single molecule sequencing methodology (for example, as described in Harris TD et al., Science 320: 106109 [2008]). In the tSMS technique, a DNA sample is cleaved into strands of approximately 100 to 200 nucleotides and a polyA sequence is added to the 3 'end of each DNA strand. Each strand is labeled by the addition of a fluorescently labeled adenosine nucleotide. The DNA strands are then hybridized to a flow cell, containing millions of T-trace sites that are immobilized on the flow cell surface. In certain embodiments, the patterns can be at a density of about 100 million patterns / cm2. The flow cell is then loaded onto an instrument, for example, HeliScope® sequencer and a laser illuminates the surface of the flow cell, revealing the position of each pattern. A CCD camera can map the position of the patterns on the flow cell surface. The standard fluorescent label is then cleaved and washed. The sequencing reaction begins with the introduction of a DNA polymerase and a fluorescently labeled nucleotide. The oligoT nucleic acid serves as a primer. The polymerase incorporates the labeled nucleotides to the primer in a pattern-oriented manner. The polymerase and unincorporated nucleotides are removed. The patterns that guided the incorporation of the fluorescently labeled nucleotide are distinguished by the imaging of the flow cell surface. After imaging, a cleavage step removes the fluorescent label and the process is repeated with other fluorescently labeled nucleotides until the desired reading length is reached. Sequence information is collected with each nucleotide addition step. Total genome sequencing by single molecule sequencing technologies excludes or typically precludes PCR-based amplification in the preparation of the sequencing libraries, and the methods allow for direct measurement of the sample, rather than measuring copies of that sample.
[00392] In another illustrative but not limiting embodiment, the methods described herein comprise obtaining sequence information for nucleic acids in the test sample, for example, cfDNA in a maternal test sample, cfDNA or cellular DNA in an individual being evaluated for cancer and the like, using 454 sequencing (Roche) (for example, as described in Margulies, M. et al. Nature 437: 376-380 [2005]). 454 sequencing typically involves two steps. In the first step, the DNA is divided into two fragments of approximately 300 to 800 base pairs and the fragments have an abrupt end. The oligonucleotide adapters are then attached to the ends of the fragments. The adapters are used as an initiator for the amplification and sequencing of fragments. The fragments can be attached to the DNA capture beads, for example, streptavidin-coated beads using, for example, Adapter B, containing the label 5'-biotin. The fragments linked to the beads are amplified by PCR inside droplets of an oil-water emulsion. The result is, multiple copies of clonally amplified DNA fragments in each bead. In the second stage, the pearls are captured in the reservoirs (for example, picolitre size reservoirs). Pyrosequencing is performed on each DNA fragment in parallel. The addition of one or more nucleotides generates a light signal that is recorded by a CCD camera on a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated. Pyro-sequencing makes use of pyrophosphate (PPi) which is released on the addition of nucleotides. PPi is converted to ATP by ATP sulfurylase in the presence of 5 'adenosine phosphosulfate. Luciferase uses ATP to convert luciferin to oxyluciferin and this reaction generates light that is measured and analyzed.
[00393] In another illustrative but not limiting modality, the methods described here comprise obtaining the sequence information for the nucleic acids in the test sample, for example, cfDNA in a maternal test sample, cfDNA or cellular DNA in an individual being evaluated for cancer and the like, using SOLiD® technology (Applied Biosystems). In sequencing by SOLiD® ligation, the genomic DNA is divided into fragments and the adapters are attached to the 5 'and 3' ends of the fragments to generate a fragment library. Alternatively, the internal adapters can be introduced by connecting adapters to the 5 'and 3' ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adapter and connecting the to the 5 'and 3' ends of the resulting fragments to release a library paired by companion. Next, clonal pearl populations are prepared in micro-reactors containing beads, primers, pattern and PCR components. Following PCR, the patterns are denatured and the beads are enriched to separate the beads with the extended patterns. The patterns on the selected beads are subjected to a 3 'modification that allows attachment to a glass slide. The sequence can be determined by sequential hybridization and ligation of partially random oligonucleotides with a determined central base (or base pairs) that is identified by a specific fluorophore. After a color is registered, the bound oligonucleotide is cleaved and removed and the process is then repeated.
[00394] In another illustrative but not limiting embodiment, the methods described here comprise obtaining the sequence information for the nucleic acids in the test sample, for example, cfDNA in a maternal test sample, cfDNA or cellular DNA in an individual being evaluated for cancer and the like, using Pacific Biosciences simple molecule real-time sequencing (SMRTTm) technology. In SMRT sequencing, the continuous incorporation of nucleotides labeled with pigment and subjected to image formation during DNA synthesis.
[00395] Simple DNA polymerase molecules are attached to the bottom surface of individual zero-wavelength detectors (ZMW detectors) obtaining sequence information while the phospholigated nucleotides are being incorporated into the development primer filament. A ZMW detector comprises a containment structure that allows the observation of incorporation of a single nucleotide by DNA polymerase against a background of fluorescent nucleotides that diffuse rapidly in and out of the ZMW (for example, in microseconds). This typically takes several milliseconds to incorporate a nucleotide into a development strand. During this time, the fluorescent label is stimulated and produces a fluorescent signal and the fluorescent label is cleaved. Measuring the corresponding fluorescence of the pigment indicates that the base has been incorporated. The process is repeated to provide a sequence.
[00396] In another illustrative but not limiting embodiment, the methods described here comprise obtaining sequence information for the nucleic acids in the test sample, for example, cfDNA in a maternal test sample, cfDNA or cellular DNA in an individual being assessed for cancer and the like using nanopore sequencing (for example, as described in Soni GV and Meller A. Clin Chem 53: 1996-2001 [2007]). Nanopore sequencing DNA analysis techniques are developed by several companies, including, for example, Oxford Nanopore Technologies (Oxford, United Kingdom), Sequenom, NABsys and the like. Nanoporous sequencing is a simple molecule sequencing technology since a simple DNA molecule is directly sequenced when it passes through a nanopore. A nanopore is a small hole, typically on the order of 1 nanometer in diameter. The immersion of a nanopore in a conduction fluid and application of a potential (voltage) through it results in a light electric current due to the conduction of ions through the nanopore. The amount of current flowing is sensitive to the size and shape of the nanopore. When a DNA molecule passes through a nanopore, each nucleotide in the DNA molecule obstructs the nanopore to a different degree, changing the magnitude of the current through the nanopore to different degrees. In this way, this change in the current when the DNA molecule passes through the nanopore provides a reading of the DNA sequence.
[00397] In another illustrative but not limiting embodiment, the methods described here comprise obtaining the sequence information for the nucleic acids in the test sample, for example, cfDNA is a maternal test sample, cfDNA or cellular DNA in an individual being evaluated for cancer and the like, using a chemical sensitive field effect transistor (chemFET) series (for example, as described in US Patent Application Publication No. 2009/0026082). In an example of this technique, the DNA molecules can be placed in the reaction chambers and the standard molecules can be hybridized to a polymerase linked sequencing primer. The incorporation of one or more triphosphates into a new strand of nucleic acid at the 3 'end of the sequencing primer can be differentiated as a change in current by a chemFET. A series can have multiple chemFET sensors.
In another example, simple nucleic acids can be linked to beads and nucleic acids can be amplified in the bead and individual beads can be transferred to individual reaction chambers in a chemFET series, with each chamber having a chemFET sensor and the acids nucleic acids can be sequenced.
[00398] In another embodiment, the present method comprises obtaining sequence information for nucleic acids in the test sample, for example, cfDNA in the maternal test sample, using transmission electron microscopy (TEM). The method, called Fast Nano Transfer of Individual Molecule Placement (IMPRNT), comprises the use of transmission electron microscope imaging with single atom resolution of high molecular weight DNA (150 kb or greater) selectively labeled with markers with heavy atom and arranging these molecules in ultrafine films in parallel ultra-dense series (3 nm filament to filament) with consistent base-to-base spacing. The electron microscope is used to image the molecules in the films to determine the position of the heavy atomic markers and to extract base sequence information from the DNA. The method is further described in PCT patent publication WO 2009/046445. The method allows sequencing for sequencing the complete human genomes in less than ten minutes.
[00399] In another embodiment, DNA sequencing technology is the sequencing of single molecule Ion Torrent, which combines semiconductor technology with simple sequencing chemistry to directly translate chemically encoded information (A, C, G, T ) in digital information (0, 1) on a semiconductor chip. In a natural state, when a nucleotide is incorporated into a DNA strand by a polymerase, a hydrogen ion is released as a by-product. Ion Torrent uses a series of high density micro-machined reservoirs to carry out its biochemical process in a massively parallel manner. Each reservoir contains a different DNA molecule. Below the reservoirs there is an ion-sensitive layer and below that an ion sensor. When a nucleotide, for example a C, is added to a DNA pattern and is then incorporated into a DNA strand, a hydrogen ion will be released. The charge of that ion will change the pH of the solution and can be detected by the ion sensor from Ion Torrent. The sequencer — essentially the smallest solid-state pH meter in the world — calls the base, moving directly from chemical information to digital information. The Ion personal Genome Machine (PGMTm) sequencer then sequentially floods the chip with one nucleotide after another. If the next nucleotide that floods the chip is not compatible. No voltage changes will be recorded and no base will be called. If there are two identical bases in the DNA strand, the voltage will be double and the chip will register two identical bases called. Direct detection allows recording of nucleotide incorporation in seconds.
[00400] In another embodiment, the present method comprises obtaining sequence information for the nucleic acids in the test sample, for example, cfDNA in a maternal test sample, using hybridization sequencing. Hybridization sequencing comprises contacting the plurality of polynucleotide sequences with a plurality of polynucleotide probes, wherein each of the plurality of polynucleotide probes can optionally be attached to a substrate. The substrate must be the flat surface comprising a series of known nucleotide sequences. The series hybridization pattern can be used to determine the polynucleotide sequences present in the sample. In other embodiments, each probe is attached to a pearl, for example, a magnetic pearl or the like. Bead hybridization can be determined and used to identify the plurality of polynucleotide sequences within the sample.
[00401] In some embodiments of the methods described here, the mapped sequence labels comprise sequence readings of about 20 base pairs, about 25 base pairs, about 30 base pairs, about 35 base pairs, about 40 base pairs, about 45 base pairs, about 50 base pairs, about 55 base pairs, about 60 base pairs, about 65 base pairs, about 70 base pairs, about 75 base pairs, about 80 base pairs, about 85 base pairs, about 90 base pairs, about 95 base pairs, about 100 base pairs, about 110 base pairs, about 120 base pairs base, about 130, about 140 base pairs, about 150 base pairs, about 200 base pairs, about 250 base pairs, about 300 base pairs, about 350 base pairs, about 400 base pairs, about 450 base pairs or about 500 base pairs. Technological advances are expected to allow simple readings greater than 500 base pairs allowing readings greater than about 1000 base pairs when endpoint pairs are generated. In one embodiment, the mapped sequence labels comprise sequence readings that are 36 base pairs. The mapping of the sequence labels is achieved by comparing the label sequence with the reference sequence to determine the chromosomal origin of the sequenced nucleic acid molecule (for example, cfDNA) and specific genetic sequence information is not required. A small degree of incompatibility (0 to 2 incompatibilities per sequence label) may be allowed to take into account the minor polymorphisms that may exist between the reference genome and the genomes in the mixed sample.
[00402] A plurality of sequence labels are typically obtained per sample. In some embodiments, at least about 3 x 106 sequence labels, at least about 5 x 106 sequence labels, at least about 8 x 106 sequence labels, at least about 10 x 106 sequence labels, at least about 15 x 106 sequence labels, at least about 20 x 106 sequence labels, at least about 30 x 106 sequence labels, at least about 40 x 106 sequence labels or at least about 50 x 106 labels of sequence comprising readings between 20 and 40 base pairs, for example, 36 base pairs, are obtained from mapping the readings to the reference genome per sample. In one embodiment, all sequence readings are mapped to all regions of the reference genome. In one embodiment, the labels that have been mapped to all regions, for example, all chromosomes, of the reference genome are counted and the CNV, that is, the supra or underrepresentation of a sequence of interest, for example, a chromosome or portion thereof, in the mixed DNA sample is determined. The method does not require differentiation between the two genomes.
[00403] The precision required to correctly determine whether a CNV, for example, aneuploidy, is present or absent in a sample, is based on the variation in the number of sequence labels that maps to the reference genome between samples within a sample. sequencing series (inter-chromosomal variability) and the variation in the number of sequence labels that maps to the reference genome in different sequencing series (inter-sequencing variability). For example, variations can be particularly pronounced when referring to labels that map to GC-rich or GC-deficient reference strings. Other variations may result from the use of different protocols for the extraction and purification of nucleic acids, the preparation of the sequencing libraries and the use of different sequencing platforms. The present method uses sequence doses (chromosome doses or segment doses) based on the knowledge of normalization sequences (normalization of chromosome sequences or normalization of segment sequences), to explain intrinsically the accumulated variability of intercromosomal variability (intra- series) and inter-sequencing (inter-series) and platform dependent. Chromosome doses are based on knowledge of chromosome sequence normalization, which can be composed of a single chromosome or two or more chromosomes selected from chromosomes 1-22, X and Y. Alternatively, chromosome sequence normalization can be composed of a single chromosome segment or two or more segments of a chromosome or two or more chromosomes. Segment doses are based on knowledge of a normalization segment sequence, which can be composed of a single segment of any chromosome or of two or more segments of any of two or more of chromosomes 1-22, X and Y. CNV and Prenatal Diagnostics [00404] Fetal DNA and RNA free of cells circulating in maternal blood can be used for the early non-invasive prenatal diagnosis (NIPD) of an increasing number of genetic conditions, both for the control of pregnancy and to assist in reproductive decision-making. The presence of DNA free of cells circulating in the bloodstream has been known for more than 50 years. More recently, the presence of small amounts of circulating fetal DNA has been discovered in the maternal bloodstream during pregnancy (Lo et al., Lancet 350: 485-487 [1997]). Thought to stem from the death of placental cells, cell-free fetal DNA (cfDNA) has been shown to consist of short fragments typically smaller than 200 base pairs in length Chan et al., Clin Chem 50: 88-92 [2004] ), which can be discerned as early as 4 weeks of gestation (Illanes et al., Early Human Dev 83: 563-566 [2007]) and known to be evident from circulation within the hours of delivery (Lo et al. , Am J Hum Genet 64: 218-224 [1999]). In addition to cfDNA, fragments of cell-free fetal RNA (cfRNA) can also be discerned in the maternal bloodstream, which originates from genes that are transcribed in the fetus or placenta. The extraction and subsequent analysis of these fetal genetic elements from a maternal blood sample offer new opportunities for NIPD.
[00405] The present method is an independent polymorphism method that for use in NIPD and does not require that the fetal cDNA be distinguished from the maternal cfDNA to allow the determination of fetal aneuploidy. In some modalities, aneuploidy is a complete chromosomal trisomy or monosomy or partial trisomy or monosomy. Partial aneuploidies are caused by loss or gain of part of a chromosome and cover chromosomal imbalances resulting from unbalanced translocations, unbalanced inversions, cancellations and insertions. By far, the most common aneuploidy known and compatible with life is trisomy 21, that is, Down Syndrome (DS), which is caused by the presence of part or all of chromosome 21. Rarely, DS can be caused by an inherited or sporadic defect, whereby an extra copy of all or part of chromosome 21 becomes linked to another chromosome (usually chromosome 14) to form a single aberrant chromosome. DS is associated with intellectual disability, severe learning difficulties and excess mortality from long-term health problems, such as heart disease. Other aneuploidies with known clinical significance include Edward's syndrome (trisomy 18) and Patau's syndrome (trisomy 13), which are often fatal within the first few months of life. Abnormalities associated with the number of sex chromosomes are also known and include monosomy X, for example, Turner syndrome (XO) and triple X syndrome (XXX) in female births and Kleinefelter syndrome (XXY) and XYY syndrome in male births , which are also associated with the various phenotypes including sterility and reduced intellectual abilities. Monosomy X [45, X] is a common cause of early pregnancy loss representing about 7% of spontaneous abortions. Based on the frequency of live births of 45, X (also called Turner syndrome) of 1-2 / 10,000, it is estimated that less than 1% of 45, X conceptions will survive childbirth. About 30% of patients with Turners' syndrome are mosaics with either a 45, X cell line or a 46, XX cell line or one containing a rearranged X chromosome (Hook and Warburton 1983). The phenotype in a child born alive is relatively mild considering the high embryonic lethality and the hypothesis has been raised that the possibility that all females born alive with Turner syndrome will carry a cell line containing two sex chromosomes. Monosomy X can occur in females like 45, X or like 45, X / 46XX and in evils like 45, X / 46XY. Autosomal monosomies in humans are generally suggested to be incompatible with life; however, there are a large number of cytogenetic reports that describe the total monosomy of a chromosome 21 in children born alive (Vosranova Jet al., Molecular Cytogen. 1:13 [2008]; Joosten et al., Prenatal Diagn. 17: 271- 5 [1997] The methods described here can be used to diagnose these and other abnormalities prenatally.
[00406] According to some modalities, the methods described here can determine the presence or absence of chromosomal trisomies of any of chromosomes 1-22, X and Y. Examples of chromosomal trisomies that can be detected according to the present method includes, without limitation, trisomy 21 (T21; Down's syndrome), trisomy 18 (T18; Edward's syndrome), trisomy 16 (T16), trisomy 20 (T20), trisomy 22 (T22; Cat's eye syndrome), trisomy 15 (T15; Prader Willi syndrome), trisomy 13 (T13; Patau syndrome), trisomy 8 (T8; Warkany syndrome), trisomy 9 and XXY trisomies (Kleinefelter syndrome), XYY or XXX trisomies. The complete trisomies of other autosomes that exist in a non-mosaic state are lethal, but can be compatible with life when present in a mosaic state. It will be estimated that several complete trisomies, existing in a mosaic or non-mosaic state and partial trisomies can be determined in fetal cfDNA according to the explanations provided here.
[00407] Non-limiting examples of partial trisomies that can be determined by the present method include, but are not limited to, partial trisomy 1q32-44, trisomy 9 p, mosaicism of trisomy 4, trisomy 17p, partial trisomy 4q26-qter, partial 2p trisomy, 1q partial trisomy and / or 6p partial trisomy / 6q monosomy.
[00408] The methods described here can also be used to determine X chromosomal monosomy, chromosomal monosomy 21 and partial monosomy, such as monosomy 13, monosomy 15, monosomy 16, monosomy 21 and monosomy 22, which are known to be involved in abortion pregnancy. The partial monosomy of chromosomes typically involved in complete aneuploidy can also be determined by the method described here. Non-limiting examples of cancellation syndromes that can be determined according to the present method include syndromes caused by partial cancellations of chromosomes. Examples of partial cancellations that can be determined according to the methods described here include, without limitation, partial cancellations of chromosomes 1, 4, 5, 7, 11, 18, 15, 13, 17, 22 and 10, which are described below.
[00409] 1q21.1 cancellation syndrome or 1q21.1 micro-cancellation (recurrent) is a rare aberration of chromosome 1. Along with the cancellation syndrome, there is also a 1q21.1 duplication syndrome. While there is a piece of DNA that is missing from the nulling syndrome at a particular point, there are two or three copies of a similar piece of DNA at the same point with the duplication syndrome. The literature refers to both cancellation and duplication as well as variations in copy number 1q21.1 (CNV). The 1q21.1 annulment can be associated with the TAR Syndrome (Absent Ray Thrombocytopenia).
[00410] Wolf-Hirschhorn syndrome (WHS) (OMIN # 194190) is a contiguous gene deletion syndrome associated with a chromosome 4p16.3 homozygous deletion. Wolf-Hirschhorn syndrome is a syndrome of congenital malformation characterized by pre- and postnatal developmental disabilities, developmental disabilities of varying degrees, characteristic craniofacial aspects (appearance of the Greek warrior's nose helmet, high forehead, prominent glabella , hypertelorism, high arched eyebrows, protruding eyes, epicantal folds, short philtrum, distinct mouth with curved down corners and micrognathia) and a seizure disorder.
[00411] The partial cancellation of chromosome 5 chromosome, also known as 5p- or 5p minus and called Cris du Chat syndrome (OMIN # 123450), is caused by an annulment of the short member (p member) of chromosome 5 (5p15 .3-p15.2). Children with this condition often have a high-pitched cry that sounds like that of a cat. The disorder is characterized by intellectual disability and delayed development, small head size (microcephaly), low birth weight and weak muscle tone (hypotonia) in childhood, different facial features and possibly heart defects.
[00412] Williams-Beuren syndrome also known as chromosome 7q11.23 nullification syndrome (OMIN 194050) is a contiguous gene syndrome resulting in a multiple system disorder caused by hemizygote cancellation of 1.5 to 1.8 Mb on chromosome 7q11.23, containing approximately 28 genes.
[00413] Jacobsen's syndrome, also known as 11q annulment disorder, is a rare congenital disorder that results from the annulment of a terminal region of chromosome 11 that includes the 11q24.1 range. This can cause intellectual disabilities, a distinct facial appearance, and a variety of physical problems including heart defects and a bleeding disorder.
[00414] The partial monosomy of chromosome 18, known as monosomy 18p is a rare chromosomal disorder in which all or part of the short member (p) of chromosome 18 is canceled (monosomal). The disorder is typically characterized by short stature, varying degrees of mental retardation, speech delays, malformations of the skull and facial region (craniofacial) and / or additional physical abnormalities. The associated craniofacial defects can vary widely in scope and severity from case to case.
[00415] Conditions caused by changes in the structure or copy number of chromosome 15 include Angelman Syndrome and Prader-Willi Syndrome, which involve a loss of gene activity on the same part of chromosome 15, the 15q11- q13. It will be estimated that several translocations and micro-cancellations can be asymptomatic in the carrying parent, but can cause an important genetic disease in the offspring. For example, a healthy mother who carries the 15q11-q13 micro-annulment can give birth to a child with Angelman syndrome, a serious neurodegenerative disorder. In this way, the methods, mechanisms and systems described here can be used to identify such partial and other cancellations in the fetus.
[00416] Partial monosomy 13q is a rare chromosomal disorder that occurs when a piece of the long arm (q) of chromosome 13 (monosomal) is missing. Babies born with 13q partial monosomy may have low birth weight, malformations of the head and face (craniofacial region), skeletal abnormalities (especially of the hands and feet) and other physical abnormalities. Mental retardation is characteristic of this condition. The mortality rate during childhood is high among individuals born with this disorder. Almost all cases of partial 13q monosomy occur randomly, for no apparent reason (sporadic).
[00417] Smith-Magenis syndrome (SMS - OMIM # 182290) is caused by a nullification or loss of genetic material, in a copy of chromosome 17. This well-known syndrome is associated with retarded mental retardation, mentally retarded, congenital anomalies , such as congenital abnormalities, such as heart and kidney defects, and neurobehavioral abnormalities, such as severe sleep disorders and self-injury behaviors. Smith-Magenis syndrome (SMS) is caused in most cases (90%) by a 3.7-Mb interstitial annulment on chromosome 17p11.2.
[00418] The 22q11.2 deletion syndrome, also known as DiGeorge syndrome is a syndrome caused by the deletion of a small piece of chromosome 22. The deletion (22 q11.2) occurs close to the middle of the chromosome in the long limb one of the chromosome pairs. Aspects of this syndrome vary widely, even among members of the same family, and affect many parts of the body. Characteristic signs and symptoms can include birth defects, such as congenital heart disease, defects in the palate, most commonly related to neuromuscular problems with closure (velopharyngeal insufficiency), learning difficulties, mild differences in facial aspects and recurrent infections. Microanulations in the 22q11.2 chromosomal region are associated with a 20 to 30-fold increased risk of schizophrenia. [00419] The cancellations in the short member of chromosome 10 are associated with the phenotype similar to DiGeorge's Syndrome. Partial monosomy of chromosome 10p is rare, but it has been observed in a number of patients who show aspects of DiGeorge syndrome.
[00420] In one embodiment, the methods, mechanisms and systems described here are used to determine partial monosomes that include, but are not limited to, partial monosomy of chromosomes 1, 4, 5, 7, 11, 18, 15, 13, 17, 22 and 10, for example, partial monosomy 1q21.11, partial monosomy 4p16.3, partial monosomy 5p15.3-p15.2, partial monosomy 7q11.23, partial monosomy 11q24.1, partial monosomy 18p, partial monosomy of chromosome 15 (15q11-q13), partial monosomy 13q, partial monosomy 17p11.2, partial monosomy of chromosome 22 (22q11.2) and partial monosomy 10p can also be determined using the method.
[00421] Other partial monosomes that can be determined according to the methods described here include unbalanced translocation t (8; 11) (p23.2; p15.5); 11q23 micro-annulment; 17p11.2 cancellation; 22q13.3 annulment; Xp22.3 micro-annulment; 10p14 cancellation; micro-cancellation 20p, [del (22) (q11.2q11.23)], cancellations 7q11.23 and 7q36; 1p36 cancellation; 2p micro-annulment; neurofibromatosis type 1 (17q11.2 micro-annulment), Yq annulment; micro-cancellation 4p16.3; 1p36.2 micro-annulment; 11q14 annulment; 19q13.2 micro-annulment; Rubinstein-Taybi (16p13.3 micro-annulment); 7p21 micro-annulment; Miller-Dieker syndrome (17p13.3) and 2q37 micro-annulment. Partial deletions can be small deletions of part of a chromosome or these can be micro-deletions of a chromosome where the deletion of a simple gene can occur.
[00422] Serious duplication syndromes caused by duplication of chromosome members have been identified (see OMIN [Online Mendelian Inheritance in Man seen at ncbi.nlm.nih.gov/omim]). In one embodiment, the present method can be used to determine the presence or absence of duplications and / or multiplications of segments of any of chromosomes 1-22, X and Y. The non-limiting examples of duplication syndromes that can be determined from according to the present method include duplications of part of chromosomes 8, 15, 12 and 17, which are described in the following.
[00423] 8p23.1 duplication syndrome is a rare genetic disorder caused by a duplication of a human chromosome 8 region. This duplication syndrome has an estimated prevalence of 1 in 64,000 births and is reciprocal of the 8p23.1 cancellation syndrome. The 8p23.1 duplication is associated with a variable phenotype including one or more speech delay, developmental delay, mild dimorphism, with prominent forehead and arched eyebrows and congenital heart disease (CHD). [00424] Chromosome 15q Duplication Syndrome (Dup15q) is a clinically identifiable syndrome that results from duplications of chromosome 15q11-13.1 babies with Dup15q usually have hypotonia (poor muscle tone), developmental delay; they may be born with a cleft lip and / or palate or malformations of the heart, kidneys or other organs; these show some degrees of delay / cognitive disability (mental retardation), speech and language delays and sensory processing disorders.
[00425] Pallister Killian's syndrome is a result of extra chromosome material # 12. This is usually a mixture of cells, (mosaicism), some with extra material # 12 and some that are normal chromosomes (46 chromosomes without extra material # 12). Babies with this syndrome have many problems including severe mental retardation, poor muscle tone, "gross" facial features and a prominent test. These tend to have a very thin upper lip with a thicker lower lip and a short nose. Other health problems can include seizures, poor nutrition, stiff joints, cataracts in adulthood, hearing loss and heart defects. People with Pallister Killian have a shortened total life.
[00426] Individuals with the genetic condition called dup (17) (p11.2p11.2) or dup 17p carry extra genetic information (known as a duplication) of the short member of chromosome 17. The duplication of chromosome 17p11.2 constitutes the basis of Potocki-Lupski syndrome (PTLS), which is a recently recognized genetic condition with only a few dozen cases reported in the medical literature. Patients experiencing this duplication often have low muscle tone, poor nutrition and an inability to grow during childhood and are also present with the delayed development of motor and verbal milestones. Many individuals who may have PTLS have difficulty with language articulation and processing. In addition, patients may have behavioral characteristics similar to those seen in people with autism or autism-spectrum disorders. Individuals with PTLS may have heart defects and sleep apnea. A duplication of a large region on chromosome 17p12 including the PMP22 gene is known to cause Charcot-Marie Tooth disease.
[00427] CNV was associated with stillbirths. However, due to the inherent limitations of conventional cytogenetics, the contribution of CNV to stillbirths is thought to be underrepresented (Harris et al., Prenatal Diagn 31: 932-944 [2011]). As shown in the examples and described anywhere in this, the present method is capable of determining the presence of partial aneuploidies, for example, deletions and multiplications of chromosome segments and can be used to identify and determine the presence or absence of CNV that are associated with stillbirths.
Determination of CNV of clinical disorders [00428] In addition to the premature determination of birth defects, the methods described here can be applied to determine any abnormality in the representation of genetic sequences within the genome. Several abnormalities in the representation of genetic sequences of the genome have been associated with various pathologies. Such pathologies include, but are not limited to, cancer, infectious and autoimmune diseases, diseases of the nervous system, metabolic and / or cardiovascular diseases and the like.
[00429] Consequently, in various modalities, the use of the methods described here in the diagnosis and / or monitoring and / or treatment of such pathologies is considered. For example, methods can be applied to determine the presence or absence of a disease, to monitor the progression of a disease and / or the effectiveness of a treatment regimen, to determine the presence or absence of nucleic acids in a pathogen, for example example, viruses; to determine the chromosomal abnormalities associated with graft versus host disease (GVHD) and to determine the contribution of individuals to forensic analysis. Cancer CNVs [00430] Blood plasma and serum DNA from cancer patients have been shown to contain measurable amounts of tumor DNA, which can be recovered and used as a substitute source of tumor DNA and tumors are characterized by aneuploidy or inappropriate tumors of genetic sequences or total chromosomes. The determination of a difference in the quantity of a given sequence, that is, a sequence of interest, in a sample of an individual, in this way, can be used in the prognosis or diagnosis of a medical condition. In some embodiments, the present method can be used to determine the presence or absence of a chromosomal aneuploidy in a suspected or known patient suffering from cancer.
[00431] Some implementations here provide methods for detecting cancer, tracking therapeutic response and minimal residual disease based on circulating cfDNA samples using surface sequencing of samples with paired end methodology and using available fragment size information endpoint readings in pairs to identify the presence of apoptotic DNA differentially methylated from cancer cells at the base of normal cells. Tumor-derived cfDNA has been shown to be shorter than non-tumor-derived cfDNA in some cancers. Therefore, the size-based method described here can be used to determine CNV including the aneuploidies associated with these cancers, allowing (a) detection of the tumor present in an assessment or diagnostic adjustment; (b) monitoring response to therapies; (c) monitoring of minimal residual disease. [00432] In certain modalities, aneuploidy is characteristic of the individual's genome and results in a generally increased predisposition to cancer. In certain embodiments, aneuploidy is characteristic of particular cells (for example, tumor cells, proto-tumor neoplastic cells, etc.) that are or have an increased predisposition to neoplasia. Particular aneuploidies are associated with particular cancers or predispositions to particular cancers as described below. In some embodiments, a very superficial pair-end sequencing method can be used to detect / monitor the presence of cancer in a cost-effective manner.
Consequently, several modalities of the methods described here provide a determination of variation in the number of copies of sequence (s) of interest, for example, clinically relevant sequences, in a test sample from an individual where certain variations in the number of copies provide an indicator of the presence and / or a predisposition to cancer. In certain embodiments, the sample comprises a mixture of nucleic acids that is derived from two or more types of cells. In one embodiment, the mixture of nucleic acids is derived from normal and cancerous cells derived from an individual suffering from a medical condition, for example, cancer.
[00434] The development of cancer is often accompanied by a change in the number of total chromosomes, ie, total chromosomal aneuploidy and / or a change in the number of chromosome segments, ie, partial aneuploidy, caused by a process known as instability chromosomal (CIN) (Thoma et al., Swiss Med Weekly 2011: 141: w13170). Many solid tumors, such as breast cancer, are believed to progress from onset to metastasis through the accumulation of various genetic aberrations. [Sato et al., Cancer Res., 50: 7184-7189 [1990]; Jongsma et al., J Clin Pathol: Mol Path 55: 305-309 [2002])]. Such genetic aberrations, as they accumulate, can confer proliferative advantages, genetic instability and the inherent ability to quickly develop drug resistance and enhanced angiogenesis, proteolysis and metastasis. Genetic aberrations can affect recessive "tumor suppressor genes" or oncogenes that act in a dominant manner. Deletions and recombination that lead to loss of heterozygosity (LOH) are believed to play a major role in tumor progression by revealing mutated tumor suppressor alleles.
[00435] cfDNA was found in the circulation of patients diagnosed with malignancies, which include, but are not limited to lung cancer (Pathak et al. Clin Chem 52: 1833-1842 [2006]), prostate cancer (Schwartzenbach et al. Clin Cancer Res 15: 1032-8 [2009]) and breast cancer (Schwartzenbach et al. available online at breast-cancer-research.com/content/11/5/R71 [2009]). The identification of genomic instabilities associated with cancers that can be determined in the circulating cfDNA in cancer patients is a potential diagnostic and prognostic tool. In one embodiment, the methods described here are used to determine CNV of one or more sequence (s) of interest in a sample, for example, a sample comprising a mixture of nucleic acids derived from an individual who is suspected or is known to have cancer , for example, carcinoma, sarcoma, lymphoma, leukemia, germ cell tumors and blastoma. In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood that can comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, a biological sample that is needed to determine whether a CNV that is present is derived from a cell that, if a cancer is present, comprises a mixture of cancer and non-cancer cells from other biological tissues that include, but are not limited to biological fluids such as serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, transcervical lavage, fluid cerebral, ascites, milk, secretions from the respiratory, intestinal and genitourinary tracts and leukophoresis samples or in tissue biopsies, biological material collected with a swab or smear. In other modalities, the biological sample is a stool (fecal) sample. [00436] The methods described are not limited to the analysis of cfDNA. It will be recognized that similar analyzes can be performed on cellular DNA samples.
[00437] In several embodiments, the sequence (s) of interest comprise known nucleic acid sequence (s) or are suspected to play a role in the development and / or progression of cancer. examples of a sequence of interest include nucleic acid sequences, for example, total chromosomes and / or chromosome segments, which are amplified or deleted in cancer cells as described below. Total CNV number and cancer risk.
[00438] Common cancer SNPs - and by analogy common cancer CNVs can each confer only a minor increase in disease risk. However, collectively, they can cause a substantially high risk of cancers. In this regard, it is observed that the gains and losses of germ lines of large DNA segments have been reported as individuals with a predisposition to factors such as neuroblastoma, prostate and colorectal cancer, breast cancer and BRCA1-associated ovarian cancer. (see, for example, Krepischi et al. Breast Cancer Res., 14: R24 [2012]; Diskin et al. Nature 2009, 459: 987-991; Liu et al. Cancer Res 2009, 69: 2176-2179; Lucito et al. Cancer Biol Ther 2007, 6: 1592-1599; Thean et al. Genes Chromosomes Cancer 2010, 49: 99-106; Venkatachalam et al. Int J Cancer 2011, 129: 1635-1642 and Yoshihara et al. Chromosomes Genes Cancer 2011, 50: 167-177). It is observed that CNVs often found in the healthy population (common CNVs) are believed to have a role in the etiology of cancer (see, for example, Shlien and Malkin (2009) Genome Medicine, 1 (6): 62). In a test study, the hypothesis that common CNVs are associated with malignancy (Shlien et al. Proc Natl Acad Sci USA 2008, 105: 11264-11269) a map of each known CNV whose location matches that of genes related to genes related to genuine cancer (as cataloged by Higgins et al. Nucleic Acids Res 2007, 35: D721-726) was created. These were called "cancerous CNVs". In an initial analysis (Shlien et al. Proc Natl Acad Sci USA 2008, 105: 11264-11269), 770 healthy genomes were evaluated using the Affymetrix 500K series adjustment, which has an average inter-probe distance of 5.8 kb . CNVs are generally thought to be nullified in genetic regions (Redon et al. (2006) Nature 2006, 444: 444-454), it was surprising to see 49 cancerous genes that were directly covered or overlapped by a CNV in more than one person in a large reference population. In the top ten genes, cancer CNVs can be found in four or more people.
[00439] In this way it is believed that the frequency of CNV can be used to measure the risk of cancer (see, for example, U.S. Patent Publication No. 2010/0261183 A1). The frequency of CNV can be determined simply by the organism's constituent genome or it can represent a fraction derived from one or more tumors (neoplastic cells), if present.
[00440] In certain modalities several CNVs in a test sample (for example, a sample comprising a constitutional nucleic acid (germ line)) or a mixture of nucleic acids (for example, a germ line nucleic acid and acid (s) nucleic (s) derived from neoplastic cells) is determined using the methods described here for copy number variations. The identification of an increased number of CNVs in the test sample, for example, compared to a reference value, is indicative of a risk or predisposition of cancer in the individual. It will be estimated that the reference value may vary with a given population. It will also be estimated that the absolute value of the increase in the frequency of the CNV will vary depending on the resolution of the method used to determine the frequency of the CNV and other parameters. Typically, an increase in the frequency of CNV of at least about 1.2 times the reference value was determined as indicative of cancer risk. (See, for example, US Patent Publication No.: 2010/0261183 A1), for example, an increase in the frequency of CNV of at least or about 1.5 times the reference value or greater, such as 2 to 4 times the reference value is an indicator of an increased risk of cancer (for example, compared to the normal healthy reference population). [00441] A determination of the structural variation in a mammal's genome, compared to a reference value, is also believed to be indicative of cancer risk. In this context, in one embodiment, the term "structural variation" can be defined as the frequency of CNV in a mammal multiplied by the average size of CNV (in base pairs) in the mammal. In this way, high structural variation records will result due to the increased CNV frequency and / or due to the occurrence of deletions or duplications of large genomic nucleic acids. Consequently, in certain embodiments, several CNVs in a test sample (for example, a sample comprising a constitutional nucleic acid (germ line)) are determined using the methods described here to determine the size and number of variations in the number of copies. In certain embodiments, a record of total structural variation within the genomic DNA greater than about 1 megabase or greater than about 1.1 megabase or greater than about 1.2 megabase or greater than about 1.3 megabase or greater than about 1.4 megabase or greater than about 1.5 megabase or greater than about 1.8 megabase or greater than about 2 megabases of DNA is indicative of cancer risk. [00442] These methods are believed to provide a measure of the risk of any cancer, including, but not limited to, acute and chronic leukemias, lymphomas, numerous solid tumors of mesenchymal or epithelial tissue, brain, breast, liver, stomach, cancer colon cancer, B cell lymphoma, lung cancer, bronchial cancer, colorectal cancer, prostate cancer, breast cancer, pancreatic cancer, stomach cancer, ovarian cancer, urinary bladder cancer , brain cancer or central nervous system cancer, peripheral nervous system cancer, esophageal cancer, cervical cancer, melanoma, uterine or endometrial cancer, cancer of the oral cavity or pharynx, liver cancer, kidney cancer, a cancer of the biliary tract, a cancer of the small intestine or appendix, a cancer of the salivary gland, a cancer of the thyroid gland, a cancer of the adrenal gland, an osteosarcoma, a chondrosarcoma, a liposarcoma, testicular cancer and malignant fibrous histiocytoma and other cancers.
Total chromosomal aneuploidies.
[00443] As indicated above, there is a high frequency of aneuploidy in cancer. In some studies examining the prevalence of changes in somatic copy number (SCNAs) in cancer, it has been found that a quarter of the genome of a typical cancer cell is affected by full-limb SCNAs or total aneuploidy chromosome SCNAs (see, for example, Beroukhim et al. Nature 463: 899-905 [2010]). Total chromosome changes are recurrently seen in several types of cancer. For example, chromosome 8 gain is seen in 10 to 20% of cases of acute myeloid leukemia (AML), as well as some solid tumors, including Ewing's sarcoma and desmoid tumors (see, for example, Barnard et al. Leukemia 10: 5-12 [1996]; Maurici et al. Cancer Genet. Cytogenet. 100: 106-110 [1998]; Qi et al. Cancer Genet. Cytogenet. 92: 147149 [1996]; Barnard, DR et al. Blood 100: 427-434 [2002] and the like. The illustrative but not limiting list of chromosome gains and losses in human cancers is shown in Table 2. TABLE 2. Specific illustrative recurrent gains and losses in chromosome in human cancer (see, for example, Gordon et al. (2012) Nature Rev. Genetics, 13: 189-203).
[00444] In various modalities, the methods here can be used to detect and / or quantify total chromosomal aneuploidies that are associated with cancer in general, and / or that are associated with particular cancers. Thus, for example, in certain modalities, the detection and / or quantification of total chromosomal aneuploidies characterized by the gains or losses shown in Table 2 are considered.
Variations in the copy number of the chromosome segment at the limb level.
[00445] Multiple studies have reported patterns of member number copy variation across large numbers of cancer specimens (Lin et al. Cancer Res 68, 664673 (2008); George et al. PLoS ONE 2, e255 (2007); Demichelis et al. Genes Chromosomes Cancer 48: 366-380 (2009); Beroukhim et al. Nature. 463 (7283): 899-905 [2010]). Additionally, it was observed that the frequency of variations in the number of copies at the arm level decreases with the members of the chromosome length. Adjusted for this trend, most chromosome members have strong evidence of preferential gain or loss, but rarely both, across multiple cancer strains. (See, for example, Beroukhim et al. Nature. 463 (7283): 899-905 [2010]).
[00446] Consequently, in one embodiment, the methods described here are used to determine CNVs at the member level (CNVs comprising a chromosomal member or substantially a chromosomal member) in a sample. CNVs can be determined in CNVs in a test sample comprising a constitutional nucleic acid (germline) and CNVs at the member level can be identified in those constitutional nucleic acids. In certain embodiments, CNVs at the member level are identified (if present) in a sample comprising a mixture of nucleic acids (for example, nucleic acids derived from normal acids and nucleic acids derived from neoplastic cells). In certain embodiments, the sample is derived from an individual who is suspected or known to have cancer, for example, carcinoma, sarcoma, lymphoma, leukemia, germ cell tumors, blastoma and the like. In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood that can comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, the biological sample that is used to determine whether a CNV that is present is derived from a cell that, if a cancer is present, comprises a mixture of cancerous and non-cancerous cells from other biological tissues including, but not limited to, limited to biological fluids such as serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, transcervical lavage, brain fluid, ascites, milk, respiratory, intestinal and genitourinary tract secretions and leukophoresis samples or in tissue biopsies, biological material collected with a swab or smear. In other modalities, the biological sample is a stool (fecal) sample.
[00447] In various modalities, CNVs identified as indicative of the presence of cancer or an increased risk of cancer include, but are not limited to, arm level CNVs listed in Table 3. As illustrated in Table 3, certain CNVs that understand the level gain are indicative of the presence of a cancer or an increased risk for certain cancers. Thus, for example, a gain in 1 q is indicative of the presence or increased risk of acute lymphoblastic leukemia (ALL), breast cancer, GIST, HCC, lung NSC, medulloblastoma, melanoma, MPD, ovarian cancer and / or cancer of prostate. A 3q gain is indicative of the presence or increased risk of squamous esophageal cancer, SC and / or lung MPD. A 7q gain is indicative of the presence or increased risk of colorectal cancer, glioma, HCC, lung NSC, medulloblastoma, melanoma, prostate cancer and / or kidney cancer. A 7q gain is indicative of the presence or increased risk of breast cancer, colorectal cancer, esophageal adenocarcinoma, glioma, HCC, pulmonary NSC, medulloblastoma, melanoma and / or kidney cancer. A 20q gain is indicative of the presence or increased risk of breast cancer, colorectal cancer, de-differentiated liposarcoma, esophageal adenocarcinoma, squamous esophageal, glioma cancer, HCC, lung NSC, melanoma, ovarian cancer and / or kidney cancer and so on .
[00448] Similarly, as illustrated in Table 3, certain CNVs that comprise a substantial loss of level of the limb are indicative of the presence of and / or an increased risk for certain cancers. In this way, for example, a loss in Ip is indicative of the presence or increased risk of gastrointestinal stromal tumor. A 4q loss is indicative of the presence or increased risk of colorectal cancer, esophageal adenocarcinoma, lung sc, melanoma, ovarian cancer and / or kidney cancer. A loss in 17p is indicative of the presence or increased risk of breast cancer, colorectal cancer, esophageal adenocarcinoma, HCC, pulmonary NSC, pulmonary SC and / or ovarian cancer and the like. TABLE 3. Significant changes in copy number of chromosomal segment of arm level in each of 16 cancer subtypes (breast, colorectal, de-differentiated liposarcoma, esophageal adenocarcinoma, esophageal squamous, GIST (gastrointestinal stromal tumor), glioma, HCC (carcinoma hepatocellular), pulmonary NSC, pulmonary SC, medulloblastoma, melanoma, MPD (myeloproliferative disease), ovarian, prostate, acute lymphoblastic leukemia (ALL) and renal) (see, Beroukhim et al. Nature (2010) 463 (7283): 899- 905).
[00449] The examples of associations between variations in the number of copy of arm level are intended to be illustrative and not limiting. Other variations in the number of arm-level copies and their associations with cancer are known to those of skill in the art.
Minor variations in the number of copies, for example, _focal.
[00450] As indicated above, in certain embodiments, the methods described here can be used to determine the presence or absence of chromosomal amplification. In some modalities, chromosomal amplification is the gain of one or more whole chromosomes. In other modalities, chromosomal amplification is the gain of one or more segments of a chromosome. In still other modalities, chromosomal amplification is the gain of two or more segments of two or more chromosomes. In several modalities, chromosomal amplification may involve the gain of one or more oncogenes.
[00451] Dominantly active people associated with solid human tumors typically exert their effect by overexpression or altered expression. Gene amplification is a common mechanism leading to over-regulation of gene expression. Evidence from cytogenetic studies indicates that significant amplification occurs in more than 50% of human breast cancers. Most notably, amplification of the human epidermal growth factor receptor 2 (HER2) of proto-oncogene located on chromosome 17 (17 (17q21 - q22)), results in the overexpression of HER2 receptors on the cell surface leading to excessive signaling and deregulated in breast cancer and other malignancies (Park et al., Clinical Breast Cancer 8: 392-401 [2008]). A variety of oncogenes have been found to be amplified in other human malignancies. Examples of the amplification of cell oncogenes in human tumors include amplifications of: c-myc in HL60 promyelocytic leukemia cell line and in small cell lung carcinoma cell lines, N-myc in primary neuroblastomas (stages III and IV), neuroblastoma cell lines, retinoblastoma cell line and primary tumors and small cell lung carcinoma lines and tumors, L-myc in small cell lung carcinoma cell lines and tumors, c-myb in acute myeloid leukemia and in colonic carcinoma cell lines, c-erbb in squamous cell carcinoma and primary gliomas, cK-ras-2 in primary carcinomas of the lung, colon, bladder and rectum, N-ras in mammary carcinoma cell line (Varmus H. , Ann Rev Genetics 18: 553-612 (1984) [cited in Watson et al., Molecular Biology of the Gene (4th ed .; Benjamin / Cummings Publishing Co. 1987)].
[00452] Duplications of oncogenes are a common cause of many types of cancer, as is the case with amplification of P70-S6 Kinase 1 and breast cancer. In such cases, genetic duplication occurs in a somatic cell and affects only the genome of the cancer cells themselves, not the entire organism, much less any subsequent offspring. Other examples of oncogenes that are amplified in human cancers include MYC, ERBB2 (EFGR), CCND1 (Cyclin D1), FGFR1 and FGFR2 in breast cancer, MYC and ERBB2 in cervical cancer, HRAS, KRAS and MYB in colorectal cancer, MYC, CCND1 and MDM2 in esophageal cancer, CCNE, KRAS and MET in gastric cancer, ERBB1 and CDK4 in glioblastoma, CCND1, ERBB1 and MYC in head and neck cancer, CCND1 in hepatocellular cancer, MYCB in neuroblastoma, MYC, ERBB2 and AKT2 in cancer ovarian, MDM2 and CDK4 in sarcoma and MYC in small cell lung cancer. In one embodiment, the present method can be used to determine the presence or absence of amplification of an oncogene associated with cancer. In some modalities, the amplified oncogene is associated with breast cancer, cervical cancer, colorectal cancer, esophageal cancer, gastric cancer, glioblastoma, head and neck cancer, hepatocellular cancer, neuroblastoma, ovarian cancer, sarcoma and small cell lung cancer.
[00453] In one embodiment, the present method can be used to determine the presence or absence of a chromosomal suppression. In some embodiments, chromosomal suppression is the loss of one or more entire chromosomes. In other embodiments, chromosomal suppression is the loss of one or more segments of a chromosome. In still other modalities, chromosomal suppression is the loss of two or more segments of two or more chromosomes. Chromosomal suppression may involve the loss of one or more tumor suppressor genes.
[00454] Chromosomal deletions involving genetic tumor suppressors are believed to play an important role in the development and progression of solid tumors. The retinoblastoma tumor suppressor (Rb-1), located on chromosome 13q14, is the most extensively characterized tumor suppressor. The genetic product Rb-1, a 105 kDa nuclear phosphoprotein, apparently plays an important role in cell cycle regulation (Howe et al., Proc Natl Acad Sci (USA) 87: 5883-5887 [1990]). The altered or lost expression of the Rb protein is caused by inactivation of both genetic alleles through a point mutation or chromosomal suppression. Alterations in the Rb-i gene have been found to be present not only in retinoblastomas but also in other malignancies such as osteosarcomas, small cell lung cancer (Rygaard et al., Cancer Res 50: 5312-5317 [1990)]) and breast cancer . Restriction fragment length polymorphism (RFLP) studies indicated that such tumor types often lost 13q heterozygosity suggesting that one of the Rb-1 gene alleles was lost due to gross chromosomal suppression (Bowcock et al., Am J Hum Genet, 46: 12 [1990]). Chromosome 1 abnormalities including duplicates, deletions and unbalanced translocations involving chromosome 6 and other partner chromosomes indicate that regions of chromosome 1, in particular 1q21-1q32 and 1p11-13, could harbor oncogenes or tumor suppressors that are pathogenetically relevant for both the chronic and advanced phases of myeloproliferative neoplasms (Caramazza et al., Eur J Hematol 84: 191- 200 [2010]). Myeloproliferative neoplasms are also associated with chromosome 5 deletions. Complete loss or interstitial deletions of chromosome 5 are the most common karyotype abnormality in myelodysplastic syndromes (MDSs). Patients with del (5q) / 5q- isolated MDS have a more favorable prognosis than those with additional karyotypic defects, which tend to develop myeloproliferative neoplasms (NPMs) and acute myeloid leukemia. The frequency of unbalanced chromosome 5 deletions has led to the idea that 5q is home to one or more tumor tumor suppressors that play key roles in the growth control of stem cells / hematopoietic progenitor cells (HSCs / HPCs). Cytogenetic mapping of commonly suppressed regions (CDRs) centered on 5q31 and 5q32 identified candidate tumor suppressors, including the ribosomal subunit RPS14, the transcription factor Egr1 / Krox20 and the cytoskeletal remodeling protein, alpha-catenin (Eisenmann et al. , Oncogene 28: 3429-3441 [2009]). Cytogenetic and allelotyping studies of fresh tumors and tumor cell lines have shown that the allelic loss of several distinct regions on chromosome 3p, including 3p25, 3p21-22, 3p21.3, 3p12-13 and 3p14, are the most early genomic abnormalities and more frequent involved in a broad spectrum of the main epithelial cancers of lung, breast, kidney, head and neck, ovary, cervix, colon, pancreas, esophagus, bladder and other organs. Several tumor tumor suppressors have been mapped to the 3p chromosome region and interstitial deletions or promoter hypermethylation are considered to precede the loss of 3p or the entire chromosome 3 in the development of carcinomas (Angeloni D., Briefings Functional Genomics 6: 19-39 [ 2007]).
[00455] Newborns and children with Down syndrome (DS) often have congenital transient leukemia and are at increased risk for acute myeloid leukemia and acute lymphoblastic leukemia. Chromosome 21, harboring about 300 genes, can be involved in numerous structural aberrations, for example, translocations, suppressions and amplifications, in leukemias, lymphomas and solid tumors. In addition, genes located on chromosome 21 have been identified that play an important role in tumorigenesis. Somatic numerical as well as structural aberrations on chromosome 21 are associated with leukemias and specific genes including RUNX1, TMPRSS2 and TFF, which are located at 21q, play a role in tumorigenesis (Fonatsch C Gene Chromosomes Cancer 49: 497-508 [2010]).
[00456] Due to the foregoing, in various embodiments, the methods described here can be used to determine CNVs of segments that are known to comprise one or more oncogenes or tumor tumor suppressors, and / or that are known to be associated with a cancer or an increased risk of cancer. In certain embodiments, CNVs can be determined in a test sample comprising a constitutional nucleic acid (germline) and the segment can be identified in those constitutional nucleic acids. In certain embodiments, segment CNVs are identified (if present) in a sample comprising a mixture of nucleic acids (for example, nucleic acids derived from normal cells and nucleic acids derived from neoplastic cells). In certain embodiments, the sample is derived from a subject who is suspected or known to have cancer, for example, carcinoma, sarcoma, lymphoma, leukemia, germ cell tumors, blastoma and the like. In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood that can comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, the biological sample that is used to determine whether a CNV is present is derived from a cell that, if a cancer is present, comprises a mixture of cancerous and non-cancerous cells from other biological tissues including, but not limited to biological fluids such as serum, sweat, tears, sputum, urine, sputum, auricular flow, lymph, saliva, cerebrospinal fluid, gavages, bone marrow suspension, vaginal flow, transcervical lavage, brain fluid, ascites, milk, respiratory tract secretions , intestinal and genitourinary and leukophoresis samples or in tissue biopsies, swabs or smears. In other modalities, the biological sample is a stool (fecal) sample.
[00457] CNVs used to determine the presence of cancer and / or increased risk for cancer may comprise amplification or suppression.
[00458] In several modalities, the CNVs identified as indicative of the presence of a cancer or an increased risk for a cancer include one or more of the amplifications shown in Table 4. TABLE 4. Illustrative, but not limiting, chromosomal segments characterized by amplifications that are associated with cancers. The types of cancer listed are those identified in Beroukhim et al.
Nature 18: 463: 899-905.
[00459] In certain modalities in combination with the amplifications described above (here) or separately, the CNVs identified as indicative of the presence of a cancer or an increased risk for a cancer include one or more of the deletions shown in Table 5. TABLE 5. Illustrative, but not limiting, chromosomal segments characterized by deletions that are associated with cancers. The types of cancer listed are those identified in Beroukhim et al. Nature 18: 463: 899-905.
[00460] Aneuploids identified as characteristic of several cancers (for example, the aneuploids identified in Tables 4 and 5) may contain genes known to be implicated in cancer etiologies (for example, tumor suppressors, oncogenes, etc.). These aneuploids can also be probed to identify relevant but previously unknown genes.
[00461] For example, Beroukhim et al. supra, evaluated the potential cancer-causing genes in copy number changes using GRAIL (Gene Relationships Among Implicated Loci20), an algorithm that looks for functional relationships between genomic regions. GRAIL registers each gene in a collection of genomic regions as to its 'relationship' to genes in other regions based on the textual similarity between the abstracts published by all documents that cite the genes, in the notion that some target genes will work in common ways . These methods allow the identification / characterization of genes previously not associated with the particular cancers in question. Table 6 illustrates the target genes known to be within the identified amplified segment and prognosticated genes and Table 7 illustrates target genes known to be within the identified deleted segment and prognosticated genes. TABLE 6. Illustrative but not limiting chromosome segments and genes known or predicted to be present in regions characterized by amplification in various cancers (see, for example, Beroukhim et al. Supra.). TABLE 7. Illustrative but not limiting chromosomal segments and genes known or predicted to be present in regions characterized by amplification in various cancers (see, for example, Beroukhim et al. Supra.).
[00462] In various embodiments, it is considered to use the methods identified here to identify CNVs of segments comprising the amplified regions or genes identified in Table 6 and / or to use the methods identified here to identify CNVs of segments comprising the deleted regions or identified genes in 7.
[00463] In one embodiment, the methods described here provide a means to assess the association between gene amplification and the degree of tumor evolution. The correlation between amplification and / or suppression and stage or degree of a cancer can be prognostically important because such information can contribute to the definition of a degree of genetically established tumor that would better predict the future course of the disease with more advanced tumors having the worst prognosis. In addition, information about initial amplification and / or suppression events can be useful in associating those events as predictors of the progression of the subsequent disease.
[00464] Amplification and deletions of the gene as identified by the method can be associated with other known parameters such as tumor grade, histology, Brd / Urd labeling index, hormonal status, nodal involvement, tumor size, survival duration and other properties of tumor available from epidemiological and biostatistical studies. For example, the tumor DNA to be tested by the method could include atypical hyperplasia, dutal carcinoma in situ, stage I-III cancer and metastatic lymph nodes in order to allow the identification of associations between amplifications and suppressions and stage. The associations made can make effective therapeutic intervention possible. For example, consistently amplified regions may contain an overexpressed gene, the product of which may be capable of being fought therapeutically (for example, the tyrosine kinase growth factor receptor, p185HER2).
[00465] In various modalities, the methods described here can be used to identify amplification and / or suppression events that are associated with drug resistance by determining the variation in the copy number of the nucleic acid sequences of primary cancers to those of cells who have metastasized to other sites. If gene amplification and / or suppression is a manifestation of karyotype instability that allows the rapid development of drug resistance, more amplification and / or suppression in primary tumors of chemo-resistant patients than in tumors in chemosensitive patients would be expected. For example, if the amplification of specific genes is responsible for the development of drug resistance, regions surrounding these genes would be expected to be amplified consistently in tumor cells from pleural effusions from chemo-resistant patients, but not in primary tumors. The discovery of associations between gene amplification and / or suppression and the development of drug resistance may allow the identification of patients who would or would not benefit from adjuvant therapy.
[00466] In a manner similar to that described to determine the presence or absence of complete and / or partial fetal chromosomal aneuploids in a maternal sample, methods, apparatus and systems described here can be used to determine the presence or absence of complete and chromosomal aneuploids / or partial in any patient sample comprising nucleic acids, for example, DNA or cfDNA (including patient samples that are not maternal samples). The patient sample can be any type of biological sample as described anywhere here. Preferably, the sample is obtained by non-invasive procedures. For example, the sample can be a blood sample or the serum and plasma fractions thereof. Alternatively, the sample can be a urine sample or a fecal sample. In other modalities, the sample is a tissue biopsy sample. In any case, the sample comprises nucleic acids, for example, cfDNA or genomic DNA, which is purified and sequenced using any of the NGS sequencing methods described previously.
[00467] Both complete and partial chromosomal aneuploids associated with the formation and progression of cancer can be determined according to the present method.
[00468] In several modalities, when using the methods described here to determine the presence and / or increased risk of cancer, data normalization can be done with respect to the chromosome (s) for which the CNV is determined . In certain modalities, data normalization can be done with respect to the arm (s) of the chromosome for which the CNV is determined. In certain modalities, data normalization can be done with respect to the particular segment (s) for which CNV is determined.
[00469] In addition to the role of CNV in cancer, CNVs have been associated with an increasing number of common complex disease, including human immunodeficiency virus (HIV), autoimmune diseases and a spectrum of neuropsychiatric disorders. CNVs in infectious and autoimmune disease [00470] So far, several studies have reported the association between CNV in genes involved in inflammation and the immune response and HIV, asthma, Crohn's disease and other autoimmune disorders (Fanciulli et al., Clin Genet 77: 201213 [2010]). For example, CNV in CCL3L1, has been implicated in susceptibility to HIV / AIDS (CCL3L1, 17q11.2 deletion), rheumatoid arthritis (CCL3L1, 17q11.2 deletion) and Kawasaki disease (CCL3L1, duplication of 17q11.2); CNV in HBD-2, has been reported as predisposing to colonic Crohn's disease (HDB-2, suppression of 8p23.1) and psoriasis (HDB-2, suppression of 8p23.1); CNV in FCGR3B, has been shown to be predisposed to glomerulonephritis in systemic lupus erythematosus (FCGR3B, suppression of 1q23, duplication of 1q23), vasculitis associated with anti-neutrophil cytoplasmic antibody (ANCA) (FCGR3B, suppression of 1q23) and increase the risk of developing arthritis rheumatoid. There are at least two inflammatory or autoimmune diseases that have been shown to be associated with CNV at different genetic loci. For example, Crohn's disease is associated with a low copy number in HDB-2, but also with a common suppression polymorphism upstream of the IGRM gene that encodes a member of the GTPase family related to p47 immunity. In addition to the association with the copy number of FCGR3B, susceptibility to SLE has also been reported to be significantly increased among subjects with a lower number of copies of the complementary component C4.
[00471] Associations between genomic deletions at the GSTM1 (GSTM1, 1q23 deletion) and GSTT1 (GSTT1, 22q11.2 deletion) loci and the increased risk of atopic asthma have been reported in several independent studies. In some embodiments, the methods described here can be used to determine the presence or absence of a CNV associated with inflammation and / or autoimmune diseases. For example, the methods can be used to determine the presence of a CNV in a patient suspected of suffering from HIV, asthma or Crohn's disease. Examples of CNV associated with such diseases include, without limitation, deletions in 17q11.2, 8p23.1, 1q23 and 22q11.2 and duplications in 17q11.2 and 1q23. In some embodiments, the present method can be used to determine the presence of CNV in genes including, but not limited to CCL3L1, HBD-2, FCGR3B, GSTM, GSTT1, C4 and IRGM.
Diseases with CNV of the nervous system [00472] Associations between de novo and inherited CNV and several common neurological and psychiatric diseases have been reported in autism, schizophrenia and epilepsy and some cases of neurodegenerative diseases such as Parkinson's disease, amyotrophic lateral sclerosis (ALS) and autosomal dominant Alzheimer's disease (Fanciulli et al., Clin Genet 77: 201-213 [2010]). Cytogenetic abnormalities have been observed in patients with autism and autism spectrum disorders (ASDs) with duplications in 15q11-q13. According to the Autism Genome project Consortium, 154 CNVs including several recurrent CNVs, on chromosome 15q11-q13 or at new genomic sites including chromosome 2p16, 1q21 and 17p12 in a region associated with Smith-Magenis syndrome that overlaps with ASD. Recurrent microsuppressions or microduplications on chromosome 16p11.2 emphasized the observation that CNVs are again detected at the loci for genes such as SHANK3 (22q13.3 deletion), neurexin 1 (NRXN1, 2p16.3 deletion) and neuroglines (NLGN4 , suppression in Xp22.33) which are known to regulate synaptic differentiation and regulate the release of glutaminergic neurotransmitter. Schizophrenia has also been associated with multiple CNVs again. Microsuppressions and microduplications associated with schizophrenia contain an over-representation of genes belonging to the neurodevelopmental and glutaminergic pathways, suggesting that multiple CNVs affecting these genes may directly contribute to the pathogenesis of schizophrenia, for example, ERBB4, deletion in 2q34, SLC1A3, deletion in 5p13.3; R4PEGF4, suppression in 2q31.1; CIT, deletion on 12.24; and multiple genes with CNV again. CNVs were also associated with other neurological disorders including epilepsy (CHRNA7, 15q13.3 deletion), Parkinson's disease (SNCA 4q22 duplication) and ALS (SMN1, 5q12.2-q13.3 deletion; and SMN2 deletion). In some embodiments, the methods described here can be used to determine the presence or absence of a CNV associated with diseases of the nervous system. For example, the methods can be used to determine the presence of a CNV in a patient suspected of suffering from autism, schizophrenia, epilepsy, neurodegenerative diseases such as Parkinson's disease, amyotrophic lateral sclerosis (ALS) or autosomal dominant Alzheimer's disease. The methods can be used to determine CNV of genes associated with diseases of the nervous system including, without limitation, any of the Autistic Spectrum Disorders (ASD), schizophrenia and epilepsy and CNV of genes associated with neurodegenerative disorders such as Parkinson's disease. Examples of CNV associated with such diseases include, without limitation, duplications in 15q11-q13, 2p16, 1q21, 17p12, 16p11.2 and 4q22 and deletions in 22q13.3, 2p16.3, Xp22,33, 2q34, 5p13.3, 2q31.1, 12.24, 15q13.3 and 5q12.2. In some embodiments, the methods can be used to determine the presence of CNV in genes including but not limited to SHANK3, NLGN4, NRXN1, ERBB4, SLC1A3, RAPGEF4, CIT, CHRNA7, SNCA, SMN1, and SMN2. CNV and metabolic or cardiovascular diseases [00473] The association between metabolic and cardiovascular traits, such as familial hypercholesterolemia (FH), atherosclerosis and coronary artery disease and CNVs has been reported in several studies (Fanciulli et al., Clin Genet 77: 201-213 [2010]). For example, germline rearrangements, mainly deletions, have been observed in the LDLR gene (LDLR, suppression / duplication in 19p13.2) in some patients with FH who do not carry any other LDLR mutations. Another example is the LPA gene that encodes apolipoprotein (a) (apo (a)) whose plasma concentration is associated with the risk of coronary artery disease, myocardial infarction (MI) and stroke. Plasma concentrations of apo (a) containing Lp (a) lipoprotein vary more than 1000 times between individuals and 90% of this variability is genetically determined at the LPA locus, with plasma concentration and Lp (a) isoform size being proportional to a number highly variable 'kringle 4' repeat ranges (range 5 to 50). These data indicate that CNV in at least two genes can be associated with cardiovascular risk. The methods described here can be used in large studies to look specifically for associations of CNV with cardiovascular disorders. In some embodiments, the present method can be used to determine the presence or absence of a CNV associated with metabolic or cardiovascular disease. For example, the present method can be used to determine the presence of a CNV in a patient suspected of suffering from familial hypercholesterolemia. The methods described here can be used to determine CNV of genes associated with metabolic or cardiovascular disease, for example, hypercholesterolemia. Examples of CNV associated with such diseases include, without limitation, 19p13.2 deletion / duplication of the LDLR gene and multiplications in the LPA gene.
Apparatus and systems for determining CNV
[00474] The analysis of the sequencing data and the diagnosis derived from it are typically performed using various algorithms and programs run on a computer. Therefore, certain modalities employ processes involving data stored on or transferred through one or more computer systems or other processing systems. The modalities described here also refer to the device to perform these operations. This device can be specially built for the necessary purposes or it can be a general purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and / or data structure stored on the computer. In some modalities, a group of processors performs some or all of the analytical operations reported collaboratively (for example, through network or cloud computing) and / or in parallel. A processor or group of processors to perform the methods described here can be of various types including microcontrollers and microprocessors such as programmable devices (for example, CPLDs and FPGAs) and non-programmable devices such as ASICs with port arrangement or general purpose microprocessors.
[00475] In addition, certain modalities refer to tangible and / or non-transitory computer readable means or computer program products that include instructions and / or program data (including data structures) to perform various operations implemented by computer . Examples of computer-readable media include, are not limited to, semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media and hardware devices that are specially configured to store and perform program instructions, such as read-only memory (ROM) and random access memory (RAM) devices. Computer-readable media can be directly controlled by an end user or the media can be indirectly controlled by the end user. Examples of directly controlled media include media located in a user facility and / or media that are not shared with other entities. Examples of indirectly controlled means include those that are indirectly accessible to the user through an external network and / or through a service that provides shared resources such as the “cloud”. Examples of program instructions include both machine code, as produced by a compiler, and files containing higher-level code that can be executed by the computer using an interpreter.
[00476] In various modalities, the data or information used in the methods and apparatus described are provided in an electronic format. Such data or information may include readings and characters derived from a nucleic acid sample, counts or densities of such characters that align with particular regions of a reference sequence (for example, that align with a chromosome or chromosome segment), sequences reference sequences (including reference sequences that provide solely or essentially polymorphisms), chromosome and segment doses, determinations such as aneuploidy determinations, normalized chromosome and segment values, chromosome pairs or segments and corresponding chromosomes or normalization segments, recommendations for counseling, diagnoses and the like. As used here, data or other information provided in electronic form is available for storage on a machine and transmission between machines. Conventionally, data in electronic format is supplied digitally and can be stored as bits and / or bytes in various data structures, lists, databases, etc. The data can be incorporated electronically, optically, etc.
[00477] One modality provides a computer program product to generate an output indicating the presence or absence of an aneuploidy, for example, a fetal aneuploidy or cancer, in a test sample. The computer product may contain instructions for performing any one or more of the methods described above to determine a chromosomal abnormality. As explained, the computer product can include a non-transitory and / or tangible computer-readable medium having computer-executable or compilable logic (eg instructions) recorded therein to allow a processor to determine chromosome doses and, in some cases, whether a fetal aneuploidy is present or absent. In one example, the computer product comprises a computer-readable medium having computer-executable or compile logic (for example, instructions) recorded therein to allow a processor to diagnose fetal aneuploidy comprising: a reception procedure for receiving sequencing data from at least a portion of nucleic acid molecules from a maternal biological sample, wherein said sequencing data comprises a calculated chromosome dose and / or segment; computer-assisted logic to analyze fetal aneuploidy from said received data; and an exit procedure to generate an exit indicating the presence, absence or type of said fetal aneuploidy.
[00478] The sequence information of the sample under consideration can be mapped to chromosome reference sequences to identify multiple sequence labels for each of any one or more chromosomes of interest and to identify multiple sequence labels for a sequence segment. normalization for each of said one or more chromosomes of interest. In various modalities, the reference strings are stored in a database such as a relational or object database, for example.
[00479] It should be understood that this is not practical or still possible in most cases, for an unassisted human being to perform the computational operations of the methods described here. For example, mapping a single 30 base pair reading from a sample to any of the human chromosomes could take years of effort without the assistance of a computer device. Of course, the problem is complicated because reliable aneuploidy determinations usually require mapping thousands (for example, at least about 10,000) or millions of readings to one or more chromosomes.
[00480] The methods described here can be performed using a system to evaluate the copy number of a genetic sequence of interest in a test sample. The system comprising: (a) a sequencer for receiving nucleic acids from the test sample providing nucleic acid sequence information from the sample; (b) a processor; and (c) one or more computer-readable storage media having stored instructions therein for execution on said processor to perform a method for identifying any CNV, for example, chromosomal or partial aneuploidies.
[00481] In some modalities, the methods are instructed by a computer-readable medium having stored computer-readable instructions in it to perform a method to identify any CNV, for example, chromosomal or partial aneuploidies. Thus, a modality provides a computer program product comprising one or more non-transient, computer-readable storage media having computer-executable instructions stored in it that, when executed by one or more processors of a computer system, cause the computer system implement a method for evaluating the copy number of a sequence of interest in a test sample comprising fetal and maternal cell-free nucleic acids. The method includes: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments to a reference genome comprising the sequence of interest, thereby providing test sequence labels, wherein the reference genome is divided into a plurality of bins; (c) determining the sizes of the cell-free nucleic acid fragments in the test sample; (d) weighing the test sequence labels based on the sizes of cell-free nucleic acid fragments from which the characters are obtained; (e) calculate coverage for bins based on the heavy characters of (d); and (f) identify a variation in the number of copies in the sequence of interest from the calculated coverages. In some implementations, weighing the test sequence labels involves predisposing the covers to the test sequence labels obtained from cell-free nucleic acid fragments of a size or a size range characteristic of a genome in the test sample. In some implementations, weighing the test sequence labels involves assigning a value of 1 to the characters obtained from cell-free nucleic acid fragments of the size or size range and assigning a value of 0 to other characters. In some implementations, the method further involves determining, in bins of the reference genome, including the sequence of interest, values of a fragment size parameter including an amount of the cell-free nucleic acid fragments in the test sample having fragment sizes shorter or longer than a threshold value. Here, identifying the variation in the number of copies in the sequence of interest involves using the values of the fragment size parameter as well as the coverage calculated in (e). In some implementations, the system is configured to evaluate the number of copies in the test sample using the various methods and processes discussed above.
[00482] In some embodiments, instructions may also include automatically recording information relevant to the method such as chromosome doses and the presence or absence of a fetal chromosomal aneuploidy in a patient's medical record for a human subject who provides the maternal test sample . The patient's medical record can be maintained by, for example, a laboratory, doctor's office, hospital, health care organization, insurance company or personal medical record website. In addition, based on the results of the analysis implemented by the processor, the method may also involve prescribing, initiating, and / or changing the treatment of a human subject from whom the maternal test sample was taken. This may involve performing one or more additional tests or analyzes on additional samples taken from the subject. [00483] The described methods can also be performed using a computer processing system that is adapted or configured to perform a method to identify any CNV, for example, chromosomal or partial aneuploidies. One embodiment provides a computer processing system that is adapted or configured to carry out a method as described here. In one embodiment, the apparatus comprises a sequencing device adapted or configured to sequence at least a portion of the nucleic acid molecules in a sample to obtain the type of sequence information described anywhere here. The apparatus may also include components for processing the sample. Such components are described anywhere here.
[00484] Sequence data or other data, can be inserted in a computer or stored in a computer-readable medium, directly or indirectly. In one embodiment, a computer system is directly integrated with a sequencing device that reads and / or analyzes nucleic acid sequences from the samples. Sequences or other information from such tools are provided through the interface in the computer system. Alternatively, the sequences processed by the system are provided from a source of sequence storage such as a database or other depot. Once available to the processing apparatus, a memory device or mass storage device protects or stores, at least temporarily, nucleic acid sequences. In addition, the memory device can store label counts for various chromosomes or genomes, etc. The memory can also store various routines and / or programs to analyze the presentation of the sequence or mapped data. Such programs / routines may include programs to perform statistical analysis, etc.
[00485] In one example, a user provides a sample on a sequencing device. The data is collected and / or analyzed by the sequencing device that is connected to a computer. The software on the computer considers the collection and / or analysis of the data. Data can be stored, displayed (via a monitor or similar device), and / or sent to another location. The computer can be connected to the internet which is used to transmit the data to a portable device used by a remote user (for example, a doctor, scientist or analyst). It is understood that data can be stored and / or analyzed before transfer. In some modalities, raw data is collected and sent to a remote user or device that will analyze and / or store the data. The transfer can take place via the internet, but it can also take place via satellite or another connection. Alternatively, the data can be stored on a computer-readable medium and the medium can be sent to an end user (for example, via mail). The remote user can be in the same geographic location or a different geographic location including, but not limited to, a building, city, state, country or continent.
[00486] In some embodiments, the methods also include collecting data with respect to a plurality of polynucleotide sequences (for example, readings, characters and / or reference chromosome sequences) and sending the data to a computer or other computer system. For example, the computer can be connected to laboratory equipment, for example, a sample collection device, a nucleotide amplification device, a nucleotide sequencing device or a hybridization device. The computer can then collect applicable data gathered by the laboratory device. Data can be stored on a computer at any stage, for example, during real time collection, before shipping, during or in combination with shipping or after shipping. The data can be stored on a computer-readable medium that can be extracted from the computer. The collected or stored data can be transmitted from the computer to a remote location, for example, via a local network or a wide area network such as the internet. At the remote site, several operations can be performed on the transmitted data as described below.
[00487] Among the types of electronically formatted data that can be stored, transmitted, analyzed, and / or manipulated in the systems, apparatus and methods described here are the following: Readings obtained by sequencing nucleic acids in a test sample Characters obtained by aligning readings to a reference genome or other reference sequence or sequences The reference genome or sequence Sequence label density - Counts or character numbers for each of two or more regions (typically chromosomes or chromosomal segments) of a genome reference or other reference sequences Identities of normalization chromosomes or chromosomal segments for particular chromosomes or chromosomal segments of interest Doses for chromosomes or chromosomal segments (or other regions) obtained from corresponding chromosomes or segments of interest and chromosomes or normalization segments Thresholds to determine chromosome doses as affected, unaffected or undetermined Actual chromosome dose determinations Diagnoses (clinical condition associated with determinations) Recommendations for further testing derived from determinations and / or diagnoses Treatment and / or monitoring plans derived from determinations and / or diagnoses [00488] These various types of data can be obtained, stored transmitted, analyzed, and / or manipulated in one or more locations using a different device. Processing options span a wide spectrum. At one end of the spectrum, all or much of this information is stored and used where the test sample is processed, for example, a doctor's office or other clinical setting. At the other extreme, the sample is obtained at one location, it is processed and optionally sequenced at a different location, readings are aligned and determinations are made at one or more different locations and diagnoses, recommendations, and / or plans are still prepared at one another location (which can be a location where the sample was taken).
[00489] In various modalities, the readings are generated with the sequencing apparatus and then transmitted to a remote site where they are processed to produce aneuploidy determinations. In this remote location, as an example, the readings are aligned to a reference sequence to produce characters, which are counted and assigned to chromosomes or segments of interest. Also at the remote location, counts are converted to doses using chromosomes or associated normalization segments. Furthermore, at the remote site, doses are used to generate aneuploidy determinations.
[00490] Among the processing operations that can be employed in different locations are the following: Sample collection Processing of the preliminary sample for sequencing Sequencing Analysis of sequence data and derivation of aneuploidy determinations Diagnosis Report of a diagnosis and / or a determination to the patient or healthcare provider Developing a plan for another treatment, testing, and / or monitoring Implementing the plan Counseling [00491] Any one or more of these operations can be automated as described anywhere here. Typically, the sequencing and analysis of sequence data and derivation of aneuploidy determinations will be performed computationally. The other operations can be performed manually or automatically. [00492] Examples of places where sample collection can be performed include health professional's offices, clinics, patient homes (where a sample collection tool or kit is provided) and mobile health care vehicles. Examples of locations where sample processing prior to sequencing can be performed include health professional offices, clinics, patient homes (where a sample processing device or kit is provided), mobile health care vehicles and provider facilities for aneuploidy analysis. Examples of locations where sequencing can be performed include health professional's offices, clinics, health professional's offices, clinics, patient homes (where a sample sequencing device and / or kit is provided), mobile health care vehicles health and facilities of aneuploidy analysis providers. The location where the sequencing takes place can be provided with a specialized network connection to transmit the sequence data (typically readings) in an electronic format. Such a connection can be wired or wireless and must be and can be configured to send the data to a site where the data can be processed and / or aggregated before transmission to a processing site. Data aggregators can be maintained by health organizations such as Health Maintenance Organizations (HMOs).
[00493] The analysis and / or derivation operations can be performed at any of the preceding sites or alternatively at another specialized remote site for computing and / or for the nucleic acid sequence data analysis service. Such locations include, for example, assemblies such as general purpose server towers, the facilities of an aneuploidy analysis service trade and the like. In some modalities, the computational device used to perform the analysis is leased or rented. Computational resources can be part of an internet-accessible collection of processors such as processing resources familiarly known as the cloud. In some cases, computations are performed by a parallel or compactly parallel group of processors that are affiliated or not affiliated with each other. Processing can be performed using distributed processing such as joint computing, grid computing and the like. In such modalities, a set or grid of collective computational resources forms a super virtual computer composed of multiple processors or computers acting together to perform the analysis and / or derivation described here. These technologies as well as more conventional supercomputers can be employed to process sequence data as described here. Each is a form of parallel computing that has processors or computers. In the case of grid computing, these processors (often integral computers) are connected over a network (private, public or the Internet) by a conventional network protocol such as Ethernet. In contrast, a supercomputer has many processors connected by a local high-speed computer bus.
[00494] In certain modalities, the diagnosis (for example, the fetus has Down syndrome or the patient has a particular type of cancer) is generated in the same place as the analysis operation. In other modalities, it is held in a different location. In some instances, the diagnosis report is performed at the location where the sample was taken, although this need not be the case. Examples of places where diagnosis can be generated or reported and / or where the development of a plan is carried out include health professionals' offices, clinics, websites accessible by computers and portable devices such as cell phones, tablets, smartphones, etc. . having a wired or wireless connection to a network. Examples of places where counseling is provided include health professionals' offices, clinics, websites accessible by computers, portable devices, etc.
[00495] In some modalities, the operations of sample collection, sample processing and sequencing are performed in a first location and the analysis and derivation operation is performed in a second location. However, in some cases, sample collection is collected at one location (for example, a healthcare professional's office or clinic) and sample processing and sequencing is performed at a different location that is optionally the same location where the analysis is performed. and the derivation occurs.
[00496] In several modalities, a sequence of operations listed above can be activated by a user or entity that initiates the sample collection, processing and / or sample sequencing. After one or more of these operations have started, the other operations can proceed naturally. For example, the sequencing operation can cause the readings to be automatically collected and sent to a processing device that then conducts, often automatically and possibly without other user intervention, the sequence analysis and derivation of the aneuploidy operation. In some implementations, the result of this processing operation is then automatically released, possibly with reformatting as a diagnosis, to a system component or entity that processes reports for information to a healthcare professional and / or patient. As explained, such information can also be automatically processed to produce a treatment, test, and / or monitoring plan, possibly together with counseling information. Thus, starting an early stage operation can activate an end-to-end sequence in which the healthcare professional, patient, or other interested participant is provided with a diagnosis, plan, advice, and / or other information useful for acting in a physical condition. . This is done even if the parts of the overall system are physically separated and possibly remote from the site, for example, the sample and sequence apparatus.
[00497] Figure 5 shows an implementation of a dispersed system to produce a determination or diagnosis of a test sample. A sample collection site 01 is used to obtain a test sample from a patient such as a pregnant woman or a putative cancer patient. The samples were then supplied to a processing and sequencing site 03 where the test sample can be processed and sequenced as described above. Site 03 includes apparatus for processing the sample as well as apparatus for sequencing the processed sample. The result of the sequencing, as described anywhere here, is a collection of readings that are typically provided in an electronic format and delivered to a network such as the Internet, which is indicated by reference number 05 in Figure 5.
[00498] The sequence data is provided to a remote location 07 where the analysis and determination generation are performed. This location can include one or more powerful computing devices such as computers or processors. After the computational resources at site 07 have completed their analysis and generated a determination from the received sequence information, the determination is retransmitted to network 05. In some implementations, it is not only a determination generated at site 07 but an associated diagnosis as well is generated. The determination and / or diagnosis is then transmitted over the network and back to the sample 01 collection site as illustrated in Figure 5. As explained, this is simply one of many variations on how the various operations associated with generating a determination or diagnosis can be divided between several locations. A common variant involves providing sample collection and processing and sequencing in one location. Another variation involves providing processing and sequencing in the same location as analysis and determination generation.
[00499] Figure 6 elaborates on the options to perform various operations in different locations. In the most granular sense represented in Figure 6, each of the following operations is performed in a separate location: sample collection, sample processing, sequencing, reading alignment, determination, diagnosis and report and / or plan development.
[00500] In a modality that aggregates some of these operations, the processing and sequencing of the sample are performed in one location and alignment of reading, determination and diagnosis are performed in a separate location. See the portion of Figure 6 identified by the reference character A. In another implementation, which is identified by the character B in Figure 6, sample collection, sample processing and sequencing are all performed at the same location. In this implementation, reading and determination alignment is performed at a second location. Finally, diagnosis and reporting and / or plan development are carried out at a third location. In the implementation represented by the character C in Figure 6, the sample collection is performed in a first location, sample processing, sequencing, reading alignment, determination and diagnosis are all performed together in a second location and report and / or plan development are performed at a third location. Finally, in the implementation labeled D in Figure 6, sample collection is performed at a first location, sample processing, sequencing, reading alignment and determination are all performed at a second location and diagnosis and reporting and / or plan management are performed at a third location.
[00501] One embodiment provides a system for use in determining the presence or absence of any one or more different complete fetal chromosomal aneuploids in a maternal test sample comprising fetal and maternal nucleic acids, the system including a sequencer to receive a sample of nucleic acid and provide fetal and maternal nucleic acid sequence information from the sample; a processor; and a machine-readable storage medium comprising instructions for execution on said processor, instructions comprising: (a) code for obtaining sequence information for said fetal and maternal nucleic acids in the sample; (b) code to use said sequence information to computationally identify multiple sequence labels from fetal and maternal nucleic acids for each of any one or more chromosomes of interest selected from chromosomes 1-22, X and Y and to identify several sequence labels for at least one normalization chromosome sequence or normalization chromosome segment sequence for each of said one or more chromosomes of interest; (c) code to use said number of sequence labels identified for each of said any one or more chromosomes of interest and said number of sequence labels identified for each normalization chromosome sequence or normalization chromosome segment sequence to calculate a single chromosome dose for each of the one or more chromosomes of interest; and (d) code for comparing each of the single chromosome doses for each of the one or more chromosomes of interest to a corresponding threshold value for each of the one or more chromosomes of interest and thereby determining the presence or absence of any one or more complete different fetal chromosomal aneuploids in the sample.
[00502] In some embodiments, the code for calculating a single chromosome dose for each of the one or more chromosomes of interest comprises code for calculating a chromosome dose for a selected unit of the chromosomes of interest as the ratio of the number of sequence labels identified for the selected chromosome of interest and the number of sequence labels identified for a correspondent of at least one normalization chromosome sequence or normalization chromosome segment sequence for the selected chromosome of interest.
[00503] In some embodiments, the system also includes code to repeat the calculation of a chromosome dose for each of any remaining chromosomal segments of the one or more segments of any one or more chromosomes of interest.
[00504] In some embodiments, the one or more chromosomes of interest selected from chromosomes 1-22, X and Y comprise at least twenty chromosomes selected from chromosomes 1-22, X and Y and where the instructions comprise instructions for determining the presence or absence of at least twenty different complete fetal chromosomal aneuploids are determined.
[00505] In some embodiments, the at least one normalization chromosome sequence is a group of chromosomes selected from chromosomes 1-22, X and Y. In other embodiments, the at least one normalization chromosome sequence is a single selected chromosome chromosomes 1-22, X and Y.
[00506] Another embodiment provides a system for use in determining the presence or absence of any one or more different partial fetal chromosomal aneuploids in a maternal test sample comprising fetal and maternal nucleic acids, the system comprising: a sequencer to receive a nucleic acid sample and provide fetal and maternal nucleic acid sequence information from the sample; a processor; and a machine-readable storage medium comprising instructions for execution on said processor, instructions comprising: (a) code for obtaining sequence information for said fetal and maternal nucleic acids in said sample; (b) code to use said sequence information to computationally identify various fetal and maternal nucleic acid sequence labels for each of any one or more segments of any one or more chromosomes of interest selected from chromosomes 1-22, X and Y and to identify several sequence labels for at least one normalization segment sequence for each of said one or more segments of any one or more chromosomes of interest; (c) code using said number of sequence labels identified for each of said any one or more segments of any one or more chromosomes of interest and said number of sequence labels identified for said normalization segment sequence for calculating a single chromosome segment dose for each of said one or more segments of any one or more chromosomes of interest; and (d) code for comparing each of said single chromosome segment doses for each of said one or more segments of any one or more chromosomes of interest to a corresponding threshold value for each of said any or more chromosomal segments of any one or more chromosomes of interest and thus determine the presence or absence of one or more different partial fetal chromosomal aneuploids in said sample.
[00507] In some embodiments, the code for calculating a single chromosome segment dose comprises the code for calculating a chromosome segment dose for a selected unit of the chromosome segments as the ratio of the number of sequence labels identified to the selected chromosome segment and the number of sequence labels identified for a corresponding normalization segment sequence for the selected chromosome segment. [00508] In some modalities, the system also comprises code to repeat the calculation of a dose of chromosomal segment for each of any remaining chromosomal segments of the one or more segments of any one or more chromosomes of interest.
[00509] In some modalities, the system also comprises (i) code to repeat (a) to (d) for test samples from different maternal subjects and (ii) code to determine the presence or absence of any one or more chromosomal aneuploids fetal partial fetuses in each of said samples.
[00510] In other modalities of any of the systems provided here, the code further comprises code to automatically record the presence or absence of a fetal chromosomal aneuploidy as determined in (d) in a patient's medical record for a human subject providing the sample maternal test, where registration is performed using the processor.
[00511] In some modalities of any of the systems provided here, the sequencer is configured to perform the next generation sequencing (NGS). In some embodiments, the sequencer is configured to perform compactly parallel sequencing using synthesis-sequencing with dye reversible terminator terminators. In other embodiments, the sequencer is configured to perform sequencing-per-link. In yet other modalities, the sequencer is configured to perform single molecule sequencing.
EXPERIMENTAL PART
Example 1 Training and sequencing of primary and enriched sequencing libraries a. Enabling sequencing libraries - abbreviated protocol (ABB) [00512] All sequencing libraries, that is, primary and enriched libraries, were prepared from approximately 2 ng of purified cfDNA that was extracted from maternal plasma. Enabling the library was performed using NEBNext ™ DNA Sample Prep DNA Reagent Set 1 reagents (Part No E6000L; New England Biolabs, Ipswich, MA) for Illumina® as follows. Because cell-free plasma DNA is fragmented in nature, no further fragmentation by nebulization or sonication has been done on plasma DNA samples. The projections of approximately 2 ng of purified cfDNA fragments contained in 40 pl were converted into phosphorylated blunt ends according to the NEBNext® End Repair Module by incubating in a 1.5 ml cfDNA microcentrifuge tube with 5 pl 10X phosphorylation buffer, 2 pl of deoxynucleotide solution mixture (10 mM each of dNTP), 1 pl of a 1: 5 dilution of DNA Polymerase I, 1 pl of T4 DNA Polymerase and 1 pl of T4 Polynucleotide Kinase provided in the NEBNext ™ DNA Sample Prep DNA Reagent Set 1 for 15 minutes at 20 ° C. The enzymes were then inactivated by heat by incubating the reaction mixture at 75 ° C for 5 minutes. The mixture was cooled to 4 ° C and dA tail synthesis of blunt-ended DNA was performed using 10 µl of the dA tail synthesis master mix containing the Klenow fragment (3 'to 5' exo minus) (Reagent Set 1 DNA Prep from NEBNext ™ DNA Sample) and incubating for 15 minutes at 37 ° C. Subsequently, the Klenow fragment was heat inactivated by incubating the reaction mixture at 75 ° C for 5 minutes. Following inactivation of the Klenow fragment, 1 μΐ of a 1: 5 dilution of Oligo Illumina Genomic Adapter Mix (Part No 1000521; Illumina Inc., Hayward, CA) was used to connect Illumina adapters (Non-Y adapters) index) to the dA tail DNA using 4 μl of the T4 DNA ligase provided in the NEBNext ™ DNA Sample Prep DNA Reagent Set 1, incubating the reaction mixture for 15 minutes at 25 ° C. The mixture was cooled to 4 ° C and the cfDNA attached to the adapter was purified from unbound adapters, adapter dimers and other reagents using magnetic beads provided in the Agencourt AMPure XP PCR purification system (Part No A63881; Beckman Coulter Genomics , Danvers, MA). Eighteen PCR cycles were performed to selectively enrich the adapter-bound cfDNA (25 μ ^ using Phusion (High Fidelity Master Mix (25 μύ Finnzymes, Woburn, MA)) and complementary Illumina PCR primers (0.5 μM each) to adapters (Part No. 1000537 and 1000537) The DNA attached to the adapter was subjected to PCR (98 ° C for 30 seconds; 18 cycles of 98 ° C for 10 seconds, 65 ° C for 30 seconds and 72 ° C for 30; final extension at 72 ° C for 5 minutes and maintained at 4 ° C) using Illumina Genomic PCR Primers (Part No. 100537 and 1000538) and the HF Phusion PCR Master Mix provided in Sample Prep DNA Reagent Set 1 NEBNext ™ DNA according to manufacturer's instructions The amplified product was purified using the Agencourt AMPure XP PCR purification system (Agencourt Bioscience Corporation, Beverly, MA) according to the manufacturer's instructions available at www.beckmangenomics.com / products / AMPureXPProtocol_ 000387v001.pdf The purified amplified product was eluted in 40 μl of EB Qiagen Buffer and the concentration and size distribution of the amplified libraries were analyzed using the Agilent DNA Kit 1000 for Bioanalyzer 2100 (Agilent Technologies Inc., Santa Clara, CA ). B. Enabling sequencing libraries - full-size protocol [00513] The full-size protocol described here is essentially the standard protocol provided by Illumina and only differs from the Illumina protocol in purifying the amplified library. The Illumina protocol shows that the amplified library was purified using gel electrophoresis, while the protocol described here uses magnetic beads for the same purification step. Approximately 2 ng of purified cfDNA extracted from maternal plasma was used to prepare a primary sequencing library using NEBNext ™ DNA Sample Prep DNA Reagent Set 1 (Part No E6000L; New England Biolabs, Ipswich, MA) for Illumina® essentially according to the manufacturer's instructions. All steps, except for the final purification of the products connected to the adapter, which were performed using Agencourt magnetic beads and reagents instead of the purification column, were performed according to the protocol accompanying the NEBNext ™ Reagents for Sample Training for a library of genomic DNA that is sequenced using Illumina® GAIT. The NEBNextTM protocol essentially follows that provided by Illumina, which is available at grcf.jhml.edu/hts/protocols/11257047_ChIP_Sample_Prep.pdf.
[00514] The projections of approximately 2 ng fragments of purified cfDNA contained in 40 pl were converted into phosphorylated blunt ends according to the NEBNext® End Repair Module by incubating the 40 pl of cfDNA with 5 pl of phosphorylation buffer 10X, 2 pl of deoxynucleotide solution mixture (10 mM each of dNTP), 1 pl of a 1: 5 dilution of DNA Polymerase I, 1 pl of T4 DNA Polymerase and 1 pl of T4 Polynucleotide Kinase provided in Set 1 NEBNext ™ DNA Sample Prep Reagent in a 200 pl microcentrifuge tube in a thermal cycler for 30 minutes at 20 ° C. The sample was cooled to 4 ° C and purified using a QIAQuick column provided in the QIAQuick PCR Purification Kit (QIAGEN Inc., Valencia, CA) as follows. The 50 μΐ of reaction was transferred to the 1.5 ml microcentrifuge tube and 250 μΐ of Qiagen Buffer PB were added. The resulting 300 μΐ were transferred to a QIAquick column, which was centrifuged at 13,000 RPM for 1 minute in a microcentrifuge. The column was washed with 750 μl of Qiagen Buffer PE and re-centrifuged. Residual ethanol was removed by additional centrifugation for 5 minutes at 13,000 RPM. DNA was eluted in 39 μl of Qiagen Buffer EB by centrifugation. 34 μl dA tail synthesis of blunt-ended DNA was performed using 16 μl of the main dA tail synthesis mixture containing the Klenow fragment (3 'to 5' exo minus) (Sample Prep DNA Reagent Set 1 from NEBNextTM DNA) and incubating for 30 minutes at 37 ° C according to the manufacturer's NEBNext® Tail Synthesis Module. The sample was cooled to 4 ° C and purified using a column provided in the MinElute PCR Purification Kit (QIAGEN Inc., Valencia, CA) as follows. The 50 μl of the reaction was transferred to the 1.5 ml microcentrifuge tube and 250 μl of Qiagen Buffer PB were added. The 300 μl was transferred to the MinElute column, which was centrifuged at 13,000 RPM for 1 minute in a microcentrifuge. The column was washed with 750 μl of Qiagen Buffer PE and re-centrifuged. Residual ethanol was removed by additional centrifugation for 5 minutes at 13,000 RPM. DNA was eluted in 15 μl of Qiagen Buffer EB by centrifugation. Ten microliters of the DNA eluate were incubated with 1 μl of a 1: 5 dilution of the Oligo Illumina Genomic Adapter Mix (Part No. 1000521), 15 μl of 2X Quick-Connect Reaction Buffer and 4 μl of T4 DNA Ligase Quick, for 15 minutes at 25 ° C according to the NEBNext® Quick Connect Module. The sample was cooled to 4 ° C and purified using a MinElute column as follows. One hundred and fifty microliters of PE Qiagen Buffer were added to the 30 μΐ reaction and the entire volume was transferred to a MinElute column, which was centrifuged at 13,000 RPM for 1 minute in a microcentrifuge. The column was washed with 750 μΐ of PE Qiagen Buffer and re-centrifuged. Residual ethanol was removed by additional centrifugation for 5 minutes at 13,000 RPM. The DNA was eluted in 28 μΐ of Qiagen Buffer EB by centrifugation. Twenty-three microliters of the DNA eluate attached to the adapter were subjected to 18 cycles of PCR (98 ° C for 30 seconds; 18 cycles of 98 ° C for 10 seconds, 65 ° C for 30 seconds and 72 ° C for 30; extension final at 72 ° C for 5 minutes and maintained at 4 ° C) using Illumina Genomic PCR Primers (Part No. 100537 and 1000538) and the HF Phusion PCR Master Mix provided in DNA Sample Prep Reagent Set 1 NEBNext ™, according to the manufacturer's instructions. The amplified product was purified using the Agencourt AMPure XP PCR purification system (Agencourt Bioscience Corporation, Beverly, MA) according to the manufacturer's instructions available at www.beckmangenomics.com/products/AMPureXPProtocol_000387v001.pdf. The Agencourt AMPure XP PCR purification system removes unincorporated dNTPs, primers, initiator dimers, salts and other contaminants and recovers amplicons greater than 100 base pairs. The purified amplified product was eluted from the Agencourt beads in 40 μΐ of EB Qiagen Buffer and the size distribution of the libraries was analyzed using the Agilent DNA Kit 1000 for the Bioanalyser 2100 (Agilent Technologies Inc., Santa Clara, CA). ç. Analysis of sequencing libraries prepared according to the abbreviated protocols (a) and those of natural size (b) [00515] The electropherograms generated by the Bioanalyzer are shown in Figures 7A and 7B. Figure 7A shows the electropherogram of the DNA of the library prepared from the purified cfDNA of the plasma sample M24228 using the full-size protocol described in (a) and Figure 7B shows the electropherogram of the DNA of the library prepared from the purified cfDNA of the plasma sample M24228 using the full-size protocol described in (b). In both figures, peaks 1 and 4 represent the Lower Marker of 15 base pairs and the Upper Marker 1,500, respectively; the numbers above the peaks indicate the migration times for the library fragments; and the horizontal lines indicate the threshold defined for integration. The electropherogram in Figure 7A shows a smaller peak of fragments of 187 base pairs and a larger peak of fragments of 263 base pairs, while the electropherogram in Figure 7B shows only a peak in 265 base pairs. The integration of the peak areas resulted in a calculated concentration of 0.40 ng / pl for the 187 base pair peak DNA in Figure 7A, a 7.34 ng / pl concentration for the 263 base pair peak DNA. based on Figure 7A and a concentration of 14.72 ng / pl for the peak DNA of 265 base pairs in Figure 7B. Illumina adapters that have been linked to cfDNA are known to be 92 base pairs, which when subtracted from 265 base pairs, indicate that the peak size of cfDNA is 173 base pairs. It is possible that the smallest peak at 187 base pairs represents fragments from two primers that have been linked end-to-end. Linear double primer fragments are eliminated from the final library product when the abbreviated protocol is used. The abbreviated protocol also eliminates other fragments of less than 187 base pairs. In this example, the concentration of purified cfDNA attached to the adapter is twice that of the cfDNA attached to the adapter produced using the full-size protocol. It was observed that the concentration of the cfDNA fragments attached to the adapter was always higher than that obtained using the natural size protocol (data not shown).
[00516] Thus, an advantage of preparing the sequencing library using the abbreviated protocol is that the library obtained consistently comprises only one major peak in the range of 262 to 267 base pairs while the quality of the library prepared using the full-size protocol varies as reflected by the number and mobility of peaks except that representing the cfDNA. Non-cfDNA products would occupy the space in the flow cell and decrease the quality of the defined amplification and subsequent imaging of the sequencing reactions, which supports the overall designation of the state of aneuploidy. The abbreviated protocol was shown not to affect the library sequencing. [00517] Another advantage of preparing the sequencing library using the abbreviated protocol is that the three enzymatic steps of blunt-ended formation, dA tail synthesis and attachment to the adapter, take less than an hour to complete to support validation and implementation of a rapid aneuploid diagnostic service. [00518] Another advantage is that the three enzymatic stages of blunt-ended formation, synthesis of tail dA and connection to the adapter, are carried out in the same reaction tube, thus avoiding multiple sample transfers that could potentially lead to material loss and more importantly to possible sample disorder and sample contamination. Example 2 Non-Invasive Prenatal Testing Using Fragment Size Introduction [00519] Since its commercial introduction in late 2011 and early 2012, non-invasive prenatal testing (NIPT) of cell-free DNA (cfDNA) in maternal plasma it quickly became the method of choice to screen pregnant women at high risk for fetal aneuploidies. The methods are essentially based on isolating and sequencing cfDNA in the plasma of pregnant women and counting the number of cfDNA fragments that align with particular regions of the reference human genome (references: Fan et al., Lo et al.). These methods of DNA sequencing and molecular counting allow a highly accurate determination of the relative copy number for each of the chromosomes across the genome. High sensitivities and specificities for the detection of trisomies 21, 18 and 13 have been reproducibly obtained in multiple clinical studies (refs, cite Gil / Nicolaides meta-analysis).
[00520] More recently, additional clinical studies have shown that this method can be extended to the general obstetric population. There is no detectable difference in fetal fractions between high and medium risk populations (refs). The results of the clinical study demonstrate that NIPT using molecular counting by cfDNA sequencing performs equivalently in both populations. A statistically significant improvement in the positive predictive value (PPV) over standard serum screening has been demonstrated (refs). The lower false positive test results, when compared to serum biochemistry and nuchal translucency measurement, significantly reduced the need for invasive diagnostic procedures (see Larion et al. Abuhamad group references). [00521] Given the good performance of NIPT in the general obstetric population, simplicity of workflow and costs have now become a major consideration for the implementation of cfDNA sequencing for the detection of integral chromosomal aneuploidy in the general obstetric population (reference: ISPD Debate 1, Brisbane). Most NIPT laboratory methods use a polymerase chain reaction (PCR) amplification step after library empowerment and single-ended sequencing that requires 10 to 20 million unique cfDNA fragments to obtain reasonable sensitivity for detecting aneuploidy. The complexity of the workflow based on PCR and deeper sequencing requirements limited the potential of the NIPT assay and resulted in increased costs.
[00522] Here it is demonstrated that high analytical sensitivities and specificities can be achieved with simple library training using very low cfDNA input that does not require PCR amplification. The PCR-free method simplifies workflow, improves response time and eliminates trends that are inherent with PCR methods. The amplification-free workflow can be integrated with paired end sequencing to allow determination of the fragment length for each label and the total fetal fraction in each sample. Since fetal cfDNA fragments are shorter than maternal fragments [ref Quake 2010, Lo's Science Clin Translation article should also quote], the detection of fetal aneuploidy in maternal plasma can be made much more robust and efficient, requiring fewer cfDNA fragments unique. In combination, the improved analytical sensitivity and specificity is achieved with a very fast response time on a significantly lower number of cfDNA fragments. This potentially allows NIPT to be performed at significantly lower costs to facilitate application to the general obstetric population. Methods [00523] Peripheral blood samples were taken in BCT tubes (Streck, Omaha, NE, USA) and sent to the CLIA Illumina laboratory in Redwood City for the commercial NIPT test. Patient signed forms of consent allowed secondary plasma aliquots to be declassified and used for clinical research, with the exception of patient samples sent from New York State. Plasma samples for this work were selected to include both unaffected and aneuploid fetuses with a range of concentrations of cfDNA and fetal fractions.
Simplified Library Processing [00524] cfDNA was extracted from 900 pl of maternal plasma using the NucleoSpin 96-well blood purification kit (Macherey-Nagel, Duren, Germany) with minor modifications to accommodate a larger lysate entry. The isolated cfDNA was placed directly into the sequencing library process without any normalization of the cfDNA entry. Sequencing libraries were prepared with a TruSeq PCR-Free DNA library kit (Illumina, San Diego, CA, USA) with double indexes to barcode the cfDNA fragments for sample identification. Subsequent modifications to the library protocol were used to improve the compatibility of the library's capabilities with the low concentration of incoming cfDNA. The standard inlet volume was increased, while the concentrations of master mix and final repair adapter, tail A synthesis and binding were decreased. Additionally, after the final repair, a heat neutralization step was introduced to deactivate enzymes, the subsequent final repair SPRI pearl purification step (supplier) was removed and the elution during the binding SPRI pearl purification step subsequently used HT1 buffer (IIlumin).
[00525] A single STAR MICROLAB® liquid handler (Hamilton, Reno, NV, USA), configured with a 96-channel top and 8 1 mL pipetting channels, was used to batch process 96 plasma samples at once . The liquid handler processed each individual plasma sample through DNA extraction, qualification of the sequencing and quantification library. The individual sample libraries were quantified with AccuClear (Biotium, Hayward, CA, USA) and reservoirs of 48 samples were prepared with standardized inputs resulting in a final concentration of 32 pM for sequencing.
Paired end sequencing [00526] DNA sequencing was performed with a NextSeq 500 Illumina instrument using 2x36 base pair paired end sequencing, plus an additional 16 cycles to sequence the sample bar codes. A total of 364 samples were conducted through 8 independent sequencing batches.
[00527] Paired DNA sequences were demultiplexed using bcl2fastq (IIlumina) and mapped to the reference human genome (hg19) using the bowtie2 aligner algorithm [ref. Landmead]. Paired readings had to combine forward and reverse filaments to be counted. All mapped pairs counted exceeding 10 mapping quality records (Ruan et al.) With globally joined primary readings were designed not to overlap consecutive fixed-width genomic bins of 100 kb in size. Approximately 2% of the genome showed highly variable coverage through an independent set of NIPT samples and were excluded from further analysis.
[00528] Using information from the genomic site and available fragment size of the mapped sites at each of the two ends of the sequenced cfDNA fragments, two variables were derived for each 100 kb window: (a) total short fragment counts below 150 base pairs in length and (b) fraction of fragments between 80 and 150 base pairs within the set of total fragments below 250 base pairs. Limiting the size of the fragments to less than 150 base pairs intensifies for fragments that originate from the placenta, which is a representative for fetal DNA. The fraction of short fragments characterizes the relative fetal cfDNA amounts in the plasma mixture. The cfDNA of a trisomic fetus would be expected to have a higher fraction of mapping short readings to the trisomic chromosome compared to a euploid fetus that is disomic for this chromosome.
[00529] Short fragment counts and fractions were independently normalized to remove systematic assay trends and specific sample variations attributable to genomic guanine (GC) content using the process shown in Figure 2D. Normalized values have been reduced by removing bins that deviate from the mean integral chromosome number by more than 3 robust standard deviation measurements. Finally, for each of the two variables, reduced normalized values associated with the target chromosome were compared to those on normalization reference chromosomes to construct a t-statistic.
[00530] Data from each round of paired end sequencing followed four steps for analysis: 1) read conversion, 2) feature bin formation at 100kb resolution, 3) normalization of each feature (counts and fraction) at resolution of 100kb and 4) combination of characteristics and record for detecting aneuploidy. In step 1, the sample data is demultiplexed from the individual bar codes, aligned to the genome and filtered for sequence quality. Total step 2 counts of short fragments below 150 base pairs in length and fraction of fragments between 80 and 150 base pairs within the set of total fragments below 250 base pairs are determined for each bin. Assay bias and specific sample variations are removed in step 3. Finally, the enrichment above a reference is determined and recorded using a t test for each of the counts and fractions and combined for final recording for detecting aneuploidy.
Detection of Integral Fetal Chromosomal Aneuploidy [00531] We tested whether fraction counts and data could be combined to enable the ability to detect fetal trisomy 21. Sixteen plasma samples from pregnant women carrying fetuses with trisomy 21 and 294 samples confirmed by karyotype from unaffected pregnancies were randomly distributed through processing batches, resulting in nine flow cells for sequencing. Each algorithm step was examined separately to determine the ability of each step and combination of steps to detect aneuploidy. The final record for detecting fetal aneuploidy in the combined case was defined as the square root of the sum of the squares of the two individual t statistics and a single threshold was applied to generate a determination of “detected aneuploidy” versus “undetected aneuploidy”. Fetal Fraction Calculation [00532] For each sample, the fetal fraction was estimated using a ratio of the total number of size fragments [111,136 base pairs] to the total number of size fragments [165,175 base pairs] within a subset of the 100 kb genomic bins. Using samples from women carrying known male fetuses, the 10% peak genomic bins that had the highest correlation with fetal fraction derived from the X chromosome copy number [ref Rava] were determined. The correlation between fetal fraction estimates based on fragment size and those derived from the X chromosome in known male fetuses was computed using a “leave one out” cross-validation analysis [REF] that included both bin selection and parameter estimation regression model. The estimated fetal fraction was then derived using a linear regression model from fragment size ratios.
Results Simplified Library Processing [00533] Figure 8 shows the overall workflow and timeline for this new version of NIPT compared to the standard laboratory workflow. The entire 96-sample enablement workflow for plasma isolation, cfDNA extraction, library construction, quantification and fusion was able to process samples in less than the total 6-hour enablement time on a single STAR Hamilton. This compares to 9 hours and two Hamilton STARs with the PCR-based methods used in the CLIA laboratory. The amount of cfDNA extracted per sample was weighted at 60 pg / uL and the yield from the sequencing library output was linearly correlated (R2 = 0.94) with cfDNA input as shown in Figure 9. The average recovery was greater than 70% (addition range), indicating a highly efficient cfDNA recovery after SPRI pearl purification. Each round of sequencing used standardized quantities from 48 multiplexed samples and lasted approximately 14 hours until completion. The median number of uniquely mapped paired readings was XXX M with 95% of samples above YYY.
Paired End Sequencing [00534] The total sequencing time per batch of 48 samples was less than 14 hours on the NextSeq 500. This compares to either 40 hours (1 flow cell, 96 samples) or 50 hours (2 flow cells, 192 samples) for the laboratory process in a HiSeq 2500. The genomic sites mapped at both ends of cfDNA fragments provided information on cfDNA fragment size. Figure 10 shows the cfDNA fragment size distribution as measured from 324 samples of pregnancies with a male fetus. The size of the fragments that mapped to the autosomal chromosomes known to be euploid and essentially represent maternal chromosomes is represented by the thin curve. The average size of the inserts was 175 base pairs with XX% fragments measuring between 100 base pairs and 200 base pairs. The thick curve represents the fragment size that exclusively originates from the Y chromosome representing only fragments of fetal cfDNA. The size distribution of the specific Y chromosome sequences was smaller, weighting 167 base pairs with a periodicity of 10 bases in shorter fragment sizes.
[00535] Since the shorter cfDNA fragments are enhanced for fetal DNA, selective analysis using only shorter fragments would be expected to increase the relative fetal representation due to the preferred selection of fetal readings. Figure 11 shows the relative fetal fraction of total mapped paired end reading counts compared to paired end reading counts that are less than 150 base pairs. In general, the median fetal fraction increases by a factor of 2 compared to total counts although with some increase in variance. The 150 base pair size cut was found to provide an ideal alternative for counts with an increase in fetal representation versus variance in counts.
Detection of Integral Fetal Chromosomal Aneuploidy [00536] Each of the available measures, total counts, counts less than 150 base pairs, fraction of counts improved for fetal cfDNA (counts between 80 and 150 base pairs / counts <250 base pairs ) and the combination of the shortest fragment and fraction counts were tested for the ability to differentiate samples of trisomy 21 from those euploids on chromosome 21. Figure 12 shows the results for each of these measurements. Total counts have an average number of XX counts while counts less than 150 base pairs have an average number of YY counts. Also, as can be seen in Fig 4A and 4B, the lowest counts show the best separation between trisomy 21 and euploid essentially because this measure is improved for fetal cfDNA. The fraction alone is almost as effective as total counts to differentiate aneuploidy (Fig. 4C), but when used in combination with short fragment counts (Fig. 4D) it provides improved differentiation over short fragment counts alone. This indicates that the fraction is providing independent information that intensifies the detection of trisomy 21. When compared to the current CLIA laboratory workflow using prep library with PCR amplification and an average number of 16 M counts / sample, the workflow of PCR-free, paired end sequencing shows equivalent performance with significantly fewer counts / sample (eg 6M counts / sample or less) and a shorter, simpler sample enablement workflow. Fetal Fraction Calculation [00537] Using the X chromosome results of pregnancies with male fetuses, normalized chromosome values can be used to determine fetal fractions for counts (ClinChem ref) and compared for different cfDNA fragment sizes. The fetal fractions derived from the X chromosome were used to calibrate the ratios for a set of 140 samples and estimate performance using a “leave one out” cross-validation. Figure 13 shows the results of the cross-validated fetal fraction prognosis and demonstrates the correlation between the two data sets, indicating that fetal fraction estimates can be obtained from any samples, including those from women carrying female fetuses once a set of calibration was measured.
Debate [00538] It has been demonstrated that high analytical sensitivity and specificity for detecting fetal aneuploidy of cfDNA in maternal plasma can be obtained with an integrated PCR-free library capability with paired-end DNA sequencing. The method simplifies the workflow, improves the response time (Figure 8) and should eliminate some inherent trends with PCR methods. Paired-end sequencing allows the determination of fetal fragment length and fraction sizes that can be used to further enhance the detection of aneuploidy at significantly lower label counts compared to currently implemented commercial methods. The performance of the PCR-free paired-end implementation appears to be similar to single-ended sequencing methods that use up to three times the number of characters.
Simplifying Library Processing [00539] The PCR-free workflow has several advantages for the clinical laboratory. Because of the high throughput and linear behavior of library training, standardized sample reservoirs for sequencing can be manufactured directly from individual sample library concentrations. Inherent trends in PCR amplification of the library's training process are thus eliminated. In addition, there is no need to isolate separate liquid handlers for pre- and post-PCR activities; this reduces the capital burden for the laboratory. This streamlined workflow allows batches of samples to be prepared within a single exchange from the clinical laboratory and then sequenced and analyzed overnight. In general, reduced capital expenditure, reduced “practice” time and rapid response allow for potentially significant reductions in the cost and overall robustness of NIPT.
Paired End Sequencing [00540] Using paired end sequencing in the NextSeq 500 system has several advantages for counting cfDNA fragments. First, with double index barcodes, samples can be multiplexed at a high level allowing normalization and correction of variation round-by-round with high statistical security. In addition, because 48 samples are being multiplexed per round and the amount needed in the flow cell for clustering is limited, the sample entry requirement is significantly reduced, allowing the PCR-free library workflow to be used. With their typical cfDNA yield of approximately 5 ng per sample, researchers were able to obtain 2 to 3 rounds of sequencing per sample even without PCR amplification. This is unlike other methods that require significant amounts of plasma entry from multiple blood tubes to produce enough cfDNA to determine aneuploidy (REF). Finally, paired-end sequencing allows the determination of cfDNA fragment size and analytical enhancement for fetal cfDNA.
Detection of Integral Fetal Chromosomal Aneuploidy [00541] Our results demonstrate that counts of cfDNA fragments below 150 base pairs are able to better differentiate aneuploidy from euploid chromosomes than total counts. This observation is in contrast to the results of Fan et al., Who suggest that the accuracy of the counting statistic would be decreased by using shorter fragments (Fan et al.) Because of the reduction in the number of counts available. The fraction of short fragments also provides some differentiation for detecting trisomy 21 as implied by Yu et al., Although with less dynamic range than the counts. However, combining the count and fraction measures results in a better separation of the trisomy 21 samples from the euploid and implies that these two measures are complementary measurements for chromosome representation. Other biological measures, for example methylation, could also provide orthogonal information that could enhance the signal-to-noise ratio for detecting aneuploidy. Calculation of Fetal Fraction [00542] The methods presented here also allow an estimate of the fetal fraction in each sample without generating additional laboratory work. With many samples in each flow cell, approximately half of which are male, an exact fetal fraction estimate can be obtained for all samples by calibrating the fetal fraction measurement from the fragment size information with that determined from of male samples. In the commercial environment, the clinical experience of the researchers showed that standard counting methods using a greater number of single-ended characters led to very low false negative rates even in the absence of specific fetal fraction (REF) measurements. Given the similar detection limit observed here, equivalent test performance is expected.
Conclusion [00543] It has been demonstrated that high analytical sensitivity and specificity for detecting fetal aneuploidy from cfDNA in maternal plasma can be obtained with an integrated PCR-free library capability with paired-end DNA sequencing. This streamlined workflow has a very fast response time, potentially allowing NIPT to be performed at a significantly lower cost for use in the general obstetric population. In addition, paired-end sequencing techniques have the potential to measure other biological phenomena, as well as provide other clinical applications. For example, size information from specific methylated regions of the genome or CpG islands could provide another orthogonal measure to enhance the detection of copy number variants across the genome.
[00544] The present description can be incorporated in other specific forms without diverging from its spirit or essential characteristics. The described modalities should be considered in all aspects only as illustrative and not restrictive. The scope of the description is therefore indicated by the appended claims rather than the preceding description. All changes that come within the meaning and equivalence range of the claims must be included within its scope.

权利要求:
Claims (33)
[1]
1. Method, implemented using a computer system comprising one or more processors and memory system, to determine a variation in the copy number (CNV) of a nucleic acid sequence of interest in a test sample comprising fragments of exempt nucleic acid cells that originate from two or more genomes, the method characterized by the fact that it comprises: (a) receiving, through a computer system, sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample ; (b) align, through one or more processors, the sequence readings of the cell-free nucleic acid fragments or align fragments containing the sequence readings with bins from a reference genome comprising the sequence of interest, thereby providing labels of test sequence, in which the reference genome is divided into a plurality of bins; (c) determining fragment sizes of at least some of the cell-free nucleic acid fragments present in the test sample; (d) for cell-free nucleic acid fragments determined to be in a first size domain, calculate, through one or more processors, first coverage of the sequence labels for the reference genome bins, for each bin, by: (i) determine several sequence labels that line up the bin, and (ii) normalize the number of sequence labels that line up the bin by accounting for bin-to-bin variations due to factors other than the variation in the number copy; (e) for cell-free nucleic acid fragments determined to be in a second size domain, calculate, through one or more processors, second coverings of the sequence labels for the reference genome bins, for each bin, by: (i) determine several sequence labels that line up the bin, and (ii) normalize the number of sequence labels that line up the bin by accounting for bin-to-bin variations due to factors other than the variation in the number copy; and (f) determining a variation in the number of copies in the sequence of interest using a probability ratio calculated from the first and second coverages.
[2]
2. Method according to claim 1, characterized by the fact that the probability ratio is calculated from a t statistic of the first coverages and a t statistic of the second coverage, in which the t statistic is calculated using coverage of bins in the sequence of interest and coverage of bins in a reference region for the sequence of interest.
[3]
Method according to claim 1, characterized in that the first size domain comprises cell-free nucleic acid fragments of substantially all sizes in the test sample and the second size domain comprises only free nucleic acid fragments cell sizes smaller than a defined size.
[4]
Method according to claim 1, characterized in that the second size domain comprises only cell-free nucleic acid fragments smaller than about 150 base pairs.
[5]
5. Method according to claim 1, characterized in that the probability ratio is calculated as a first probability that the test sample is an aneuploid sample in relation to a second probability that the test sample is a sample euploid.
[6]
6. Method according to claim 1, characterized by the fact that the probability ratio is calculated from one or more fetal fraction values in addition to the first coverages and the second coverages.
[7]
Method according to claim 6, characterized in that the one or more fetal fraction values comprise a fetal fraction value calculated using information about cell-free nucleic acid fragment sizes.
[8]
Method according to claim 7, characterized by the fact that the fetal fraction value is calculated: obtaining a frequency distribution of the cell-free nucleic acid fragment sizes; and applying the frequency distribution to a model related to the fetal fraction for fragment size frequency to obtain the fetal fraction value.
[9]
9. Method according to claim 6, characterized in that the one or more fetal fraction values comprise a fetal fraction value calculated using coverage information for the reference genome bins.
[10]
10. Method according to claim 9, characterized by the fact that the fetal fraction value is calculated: applying coverage values of a plurality of bins to a model related to the fetal fraction to cover the bin to obtain the value fetal fraction.
[11]
11. Method according to claim 6, characterized in that the one or more fetal fraction values comprise a fetal fraction value calculated using coverage information for the bins of a sex chromosome.
[12]
12. Method according to claim 6, characterized by the fact that the probability ratio is calculated from a fetal fraction, a t-statistic of short fragments and a t-statistic of all fragments, in which the short fragments are fragments of Cell-free nucleic acid in a first size range smaller than a criterion size and the whole of the fragments are cell-free nucleic acid fragments including short fragments and fragments longer than the criterion size.
[13]
13. Method according to claim 12, characterized by the fact that the probability ratio is calculated: where pi represents the probability that the data originates from a multivariate normal distribution representing a model of 3 copies or 1 copy, po represents the probability that the data will originate from a multivariate normal distribution representing a 2-copy model, TCUrto, Ttotai are T counts calculated from the chromosomal coverage generated from short fragments and total fragments eq (jftotai) is a density distribution fetal fraction.
[14]
14. Method according to claim 1, characterized by the fact that the probability ratio is calculated for monosomy X, trisomy X, trisomy 13, trisomy 18 or trisomy 21.
[15]
15. Method according to claim 1, characterized by the fact that normalizing the number of sequence labels comprises: normalizing the GC content of the test sample, normalizing a global variation wave profile of a training set, and / or normalize for one or more components obtained from a principal component analysis.
[16]
16. Method according to claim 2, characterized by the fact that the reference region is selected from the group consisting of: all robust chromosomes, robust chromosomes not including the sequence of interest, at least one chromosome outside the sequence of interest and a subset of chromosomes selected from the robust chromosomes, where the robust chromosomes are autosomal chromosomes other than chromosomes 13, 18 and 21.
[17]
17. Method according to claim 16, characterized by the fact that the reference region comprises robust chromosomes that have been determined to provide the best signal selection capability for a set of training samples.
[18]
18. Method according to claim 2, characterized by the fact that it further comprises: calculating values of a size parameter for the bins, for each bin: (i) determining a value of the size parameter from the sizes of fragments of cell-free nucleic acid in the bin, and (ii) normalize the size parameter value by being responsible for variations from bin to bin due to factors other than the variation in the copy number; and determining a t statistic based on the size for the sequence of interest using values of the parameter of size of bins in the sequence of interest and values of the parameter of size of bins in the reference region for the sequence of interest.
[19]
19. Method according to claim 18, characterized by the fact that the probability ratio of (f) is calculated from the first t statistic, the second t statistic and the t statistic based on size.
[20]
20. Method according to claim 18, characterized in that the probability ratio of (f) is calculated from the t-statistic based on size and a fetal fraction.
[21]
21. Method according to claim 1, characterized in that it further comprises comparing the probability ratio to a calling criterion to determine a variation in the number of copies in the sequence of interest.
[22]
22. Method according to claim 1, characterized in that it further comprises obtaining a plurality of probability ratios and applying the plurality of probability ratios to a decision tree to determine a ploidy case for the test sample.
[23]
23. System for evaluating the copy number of a nucleic acid sequence of interest in a test sample, the system characterized by the fact that it comprises: a sequencer for receiving cell-free fragments of nucleic acid from the test sample and providing the nucleic acid sequence information of the test sample; a processor; and one or more computer-readable storage media having stored instructions therein for execution on said processor to: (a) receive sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning fragments containing the sequence readings for bins of a reference genome comprising the sequence of interest, thereby providing test sequence labels, where the genome reference is divided into a plurality of bins; (c) determining the fragment sizes of at least some of the cell-free nucleic acid fragments present in the test sample; (d) for cell-free nucleic acid fragments determined to be in a first size domain, calculate, through one or more processors, first coverage of the sequence labels for the reference genome bins, for each bin, by: (i) determine several sequence labels that line up the bin, and (ii) normalize the number of sequence labels that line up the bin by accounting for variations from bin to bin due to factors other than the variation in the copy number ; (e) for cell-free nucleic acid fragments determined to be in a second size domain, calculate, through one or more processors, second coverings of the sequence labels for the reference genome bins, for each bin, by: (i) determine several sequence labels that line up with the bin, and (ii) normalize the number of sequence labels that line up with the bin, taking responsibility for variations from bin-to-bin due to factors other than variation in the copy number; and (f) determining a variation in the number of copies in the sequence of interest using a probability ratio calculated from the first and second coverages.
[24]
24. Method for determining a variation in the copy number (CNV) of a nucleic acid sequence of interest in a test sample comprising cell-free nucleic acid fragments that originate from two or more genomes, the method characterized by the fact that which comprises: (a) receiving sequence readings obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence readings of the cell-free nucleic acid fragments or aligning fragments containing the sequence readings for bins of a reference genome comprising the sequence of interest, thereby providing test sequence labels, where the genome reference is divided into a plurality of bins; (c) determining fragment sizes of cell-free nucleic acid fragments in the test sample; (d) calculating sequence label coverage for the reference genome bins using sequence labels for cell-free nucleic acid fragments having sizes in a first size domain; (e) calculating sequence label coverage for the reference genome bins using sequence labels for cell-free nucleic acid fragments having sizes in a second size domain, where the second size domain is different from the first domain of size; (f) calculate size characteristics for the reference genome bins using the fragment sizes determined in (c); and (g) determining a variation in the number of copies in the sequence of interest using the coverages calculated in (d) and (e) and the size characteristics calculated in (f).
[25]
25. Method according to claim 24, characterized in that (g) comprises calculating a t-statistic for the sequence of interest using the size characteristics of bins in the sequence of interest calculated in (f).
[26]
26. Method according to claim 2, characterized by the fact that the t-statistic is calculated as follows: where: xi is a bin coverage of the sequence of interest, X2 is a bin coverage of the reference region, si is a standard deviation of the cover of the bins in the sequence of interest, S2 is a standard deviation of the cover of the bins in the reference region, ni is a number of bins in the sequence of interest, and n2 is a number of bins in the reference region.
[27]
27. Method according to claim 1, characterized in that it still comprises, before (a), extracting cell-free nucleic acid fragments in the test sample from a plasma sample of a pregnant female carrying a fetus, in which the cell-free nucleic acid fragments in the test sample comprise nucleic acid from the fetus and nucleic acid from the pregnant female, and sequencing the cell-free nucleic acid fragments to obtain the read sequences.
[28]
28. Method according to claim 27, characterized by the fact that it further comprises: determining that the fetus is affected by a genetic abnormality associated with the variation in the number of copies in the sequence of interest.
[29]
29. Method according to claim 28, characterized by the fact that it further comprises: prescribing, initiating and / or changing a treatment regimen, in which the treatment regimen is designed to treat genetic abnormalities that affect the fetus.
[30]
30. The method of claim 1, characterized in that it further comprises, prior to (a), extracting cell-free nucleic acid fragments in an individual's test sample, in which the nucleic acid fragments are free of cells comprise nucleic acid from cancer cells; and sequencing the cell-free nucleic acid fragments to obtain the sequence readings.
[31]
31. Method according to claim 30, characterized by the fact that it further comprises: determining that the individual is affected by a cancer associated with the variation in the number of copies in the sequence of interest.
[32]
32. Method according to claim 31, characterized by the fact that it further comprises: prescribing, initiating and / or changing a treatment regimen, in which the treatment regimen is designed to treat cancer that affects the individual.
[33]
33. The method of claim 30, characterized in that the cell-free nucleic acid fragments in the test sample are extracted from a plasma sample from an individual.

类似技术:

公开号 | 公开日 | 专利标题

US20210371907A1|2021-12-02|Using cell-free dna fragment size to determine copy number variations

US10095831B2|2018-10-09|Using cell-free DNA fragment size to determine copy number variations

US10658070B2|2020-05-19|Resolving genome fractions using polymorphism counts

NZ752319B2|2021-04-30|Using cell-free dna fragment size to determine copy number variations

CN114181997A|2022-03-15|Determination of copy number variation using cell-free DNA fragment size

同族专利:

公开号 | 公开日

CN108884491B|2021-04-27|

KR20180123020A|2018-11-14|

TWI708848B|2020-11-01|

KR102049191B1|2019-11-26|

TWI661049B|2019-06-01|

CN113096726A|2021-07-09|

DK3202915T3|2019-06-24|

NZ752319A|2021-01-29|

EA035148B1|2020-05-06|

EA201891580A1|2019-01-31|

KR20190132558A|2019-11-27|

EA202090277A2|2020-07-31|

EA202090277A3|2020-10-30|

WO2017136059A1|2017-08-10|

IL260938A|2020-03-31|

IL272710D0|2020-04-30|

CN108884491A|2018-11-23|

US20190065676A1|2019-02-28|

KR102184868B1|2020-12-02|

AU2019203491A1|2019-06-06|

US10095831B2|2018-10-09|

AU2016391100A1|2018-09-27|

AU2019203491B2|2021-05-27|

NZ745637A|2019-05-31|

MA44822A|2017-08-09|

US20170220735A1|2017-08-03|

IL272710A|2021-05-31|

CA3013572A1|2017-08-10|

MA52131A|2019-07-31|

ZA201805753B|2019-04-24|

AU2016391100B2|2019-03-07|

AR107192A1|2018-03-28|

EP3517626A1|2019-07-31|

BR112018015913A2|2019-01-22|

EP3202915A1|2017-08-09|

SG11201806595UA|2018-09-27|

EP3202915B1|2019-03-20|

TW201805429A|2018-02-16|

TW201930598A|2019-08-01|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US20100216153A1|2004-02-27|2010-08-26|Helicos Biosciences Corporation|Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities|

WO2007145612A1|2005-06-06|2007-12-21|454 Life Sciences Corporation|Paired end sequencing|

CA2668818C|2006-10-10|2018-06-26|Xenomics, Inc.|Compositions, methods and kits for isolating nucleic acids from body fluids using anion exchange media|

US8262900B2|2006-12-14|2012-09-11|Life Technologies Corporation|Methods and apparatus for measuring analytes using large scale FET arrays|

MX2010003724A|2007-10-04|2010-09-14|Halcyon Molecular|Sequencing nucleic acid polymers with electron microscopy.|

WO2009051842A2|2007-10-18|2009-04-23|The Johns Hopkins University|Detection of cancer by measuring genomic copy number and strand length in cell-free dna|

US20100261183A1|2007-11-01|2010-10-14|Adam Marcus Shlien|Method of determining risk for cancer|

JP5770737B2|2009-11-06|2015-08-26|ザチャイニーズユニバーシティオブホンコン|Genome analysis based on size|

EP2875149B1|2012-07-20|2019-12-04|Verinata Health, Inc.|Detecting and classifying copy number variation in a cancer genome|

US9411937B2|2011-04-15|2016-08-09|Verinata Health, Inc.|Detecting and classifying copy number variation|

US9323888B2|2010-01-19|2016-04-26|Verinata Health, Inc.|Detecting and classifying copy number variation|

US9260745B2|2010-01-19|2016-02-16|Verinata Health, Inc.|Detecting and classifying copy number variation|

EP2591433A4|2010-07-06|2017-05-17|Life Technologies Corporation|Systems and methods to detect copy number variation|

US9029103B2|2010-08-27|2015-05-12|Illumina Cambridge Limited|Methods for sequencing polynucleotides|

US8725422B2|2010-10-13|2014-05-13|Complete Genomics, Inc.|Methods for estimating genome-wide copy number variations|

WO2012162884A1|2011-05-31|2012-12-06|北京贝瑞和康生物技术有限公司|Kits, devices and methods for detecting chromosome copy number of embryo or tumor|

JP5659319B2|2011-06-29|2015-01-28|ビージーアイヘルスサービスカンパニーリミテッド|Non-invasive detection of genetic abnormalities in the fetus|

AU2011373694A1|2011-07-26|2013-05-02|Verinata Health, Inc.|Method for determining the presence or absence of different aneuploidies in a sample|

US9367663B2|2011-10-06|2016-06-14|Sequenom, Inc.|Methods and processes for non-invasive assessment of genetic variations|

EP2805280A1|2012-01-20|2014-11-26|Sequenom, Inc.|Diagnostic processes that factor experimental conditions|

JP5993029B2|2011-12-31|2016-09-14|ビージーアイダイアグノーシスカンパニーリミテッドＢｇｉＤｉａｇｎｏｓｉｓＣｏ．，Ｌｔｄ．|Detection method of gene mutation|

US20130150253A1|2012-01-20|2013-06-13|Sequenom, Inc.|Diagnostic processes that factor experimental conditions|

US9892230B2|2012-03-08|2018-02-13|The Chinese University Of Hong Kong|Size-based analysis of fetal or tumor DNA fraction in plasma|

US9920361B2|2012-05-21|2018-03-20|Sequenom, Inc.|Methods and compositions for analyzing nucleic acid|

US11261494B2|2012-06-21|2022-03-01|The Chinese University Of Hong Kong|Method of measuring a fractional concentration of tumor DNA|

PT2893040T|2012-09-04|2019-04-01|Guardant Health Inc|Systems and methods to detect rare mutations and copy number variation|

GB2528205B|2013-03-15|2020-06-03|Guardant Health Inc|Systems and methods to detect rare mutations and copy number variation|

US10504613B2|2012-12-20|2019-12-10|Sequenom, Inc.|Methods and processes for non-invasive assessment of genetic variations|

AU2014281635B2|2013-06-17|2020-05-28|Verinata Health, Inc.|Method for determining copy number variations in sex chromosomes|

JP6534191B2|2013-10-21|2019-06-26|ベリナタヘルスインコーポレイテッド|Method for improving the sensitivity of detection in determining copy number variation|

US10415083B2|2013-10-28|2019-09-17|The Translational Genomics Research Institute|Long insert-based whole genome sequencing|

WO2015184404A1|2014-05-30|2015-12-03|Verinata Health, Inc.|Detecting fetal sub-chromosomal aneuploidies and copy number variations|

MA40939A|2014-12-12|2017-10-18|Verinata Health Inc|USING THE SIZE OF ACELLULAR DNA FRAGMENTS TO DETERMINE VARIATIONS IN THE NUMBER OF COPIES|

US10364467B2|2015-01-13|2019-07-30|The Chinese University Of Hong Kong|Using size and number aberrations in plasma DNA for detecting cancer|

US10095831B2|2016-02-03|2018-10-09|Verinata Health, Inc.|Using cell-free DNA fragment size to determine copy number variations|JP6534191B2|2013-10-21|2019-06-26|ベリナタヘルスインコーポレイテッド|Method for improving the sensitivity of detection in determining copy number variation|

WO2015184404A1|2014-05-30|2015-12-03|Verinata Health, Inc.|Detecting fetal sub-chromosomal aneuploidies and copy number variations|

MA40939A|2014-12-12|2017-10-18|Verinata Health Inc|USING THE SIZE OF ACELLULAR DNA FRAGMENTS TO DETERMINE VARIATIONS IN THE NUMBER OF COPIES|

US10095831B2|2016-02-03|2018-10-09|Verinata Health, Inc.|Using cell-free DNA fragment size to determine copy number variations|

WO2018031739A1|2016-08-10|2018-02-15|New York Genome Center, Inc.|Ultra-low coverage genome sequencing and uses thereof|

GB201718620D0|2017-11-10|2017-12-27|Premaitha Ltd|Method of detecting a fetal chromosomal abnormality|

EP3765633A4|2018-03-13|2021-12-01|Grail, Inc.|Method and system for selecting, managing, and analyzing data of high dimensionality|

US20190295684A1|2018-03-22|2019-09-26|The Regents Of The University Of Michigan|Method and apparatus for analysis of chromatin interaction data|

EP3773534A4|2018-03-30|2021-12-29|Juno Diagnostics, Inc.|Deep learning-based methods, devices, and systems for prenatal testing|

JP6974504B2|2018-04-02|2021-12-01|イルミナインコーポレイテッド|Compositions and Methods for Making Controls for Sequence-Based Genetic Testing|

EP3776555A2|2018-04-13|2021-02-17|Grail, Inc.|Multi-assay prediction model for cancer detection|

JP6891150B2|2018-08-31|2021-06-18|シスメックス株式会社|Analysis method, information processing device, gene analysis system, program, recording medium|

CN113195741A|2018-12-21|2021-07-30|豪夫迈·罗氏有限公司|Identification of global sequence features in whole genome sequence data from circulating nucleic acids|

KR102287096B1|2019-01-04|2021-08-09|테라젠지놈케어 주식회사|Method for determining fetal fraction in maternal sample|

US20210366569A1|2019-06-03|2021-11-25|Illumina, Inc.|Limit of detection based quality control metric|

CN110373477B|2019-07-23|2021-05-07|华中农业大学|Molecular marker cloned from CNV fragment and related to porcine ear shape character|

US20210102262A1|2019-09-23|2021-04-08|Grail, Inc.|Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data|

WO2021121368A1|2019-12-18|2021-06-24|The Chinese University Of Hong Kong|Cell-free dna fragmentation and nucleases|

CN111028890B|2019-12-31|2020-09-11|东莞博奥木华基因科技有限公司|CNV detection method based on correction between run|

US20210265007A1|2020-02-05|2021-08-26|The Chinese University Of Hong Kong|Molecular analyses using long cell-free fragments in pregnancy|

CN111477275B|2020-04-02|2020-12-25|上海之江生物科技股份有限公司|Method and device for identifying multi-copy area in microorganism target fragment and application|

CN112766428B|2021-04-08|2021-07-02|臻和生物科技有限公司|Tumor molecule typing method and device, terminal device and readable storage medium|

法律状态:
2019-05-28| B15N| Others concerning applications: notification of judicial decision|Free format text: VARA: 25A VARA FEDERAL DO RIO DE JANEIROPROCESSO N.O 5022017-09.2019.4.02.5101) - NUP: 00408.027367/2019-14IMPETRANTE: KASZNAR LEONARDOS ADVOGADOSIMPETRADO: PRESIDENTE DO INSTITUTO NACIONAL DA PROPRIEDADE INDUSTRIAL - INPI?ANTE O EXPOSTO, DENEGO O MANDADO DE SEGURANCA E JULGO EXTINTO O PROCESSO SEM RESOLUCAO DO MERITO, COM FULCRO NO ARTIGO 6O, 5O, DA LEI NO 12.016/2009 C/C ARTIGO 485, VI DO CODIGO DE PROCESSO CIVIL.? |

2019-10-08| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2019-12-03| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 20/12/2016, OBSERVADAS AS CONDICOES LEGAIS. (CO) 20 (VINTE) ANOS CONTADOS A PARTIR DE 20/12/2016, OBSERVADAS AS CONDICOES LEGAIS |

优先权:

申请号 | 申请日 | 专利标题

US201662290891P| true| 2016-02-03|2016-02-03|

US15/382,508|US10095831B2|2016-02-03|2016-12-16|Using cell-free DNA fragment size to determine copy number variations|

PCT/US2016/067886|WO2017136059A1|2016-02-03|2016-12-20|Using cell-free dna fragment size to determine copy number variations|

[返回顶部]